FOSDEM 2015: Building High-Performance Language Implementations With Low Effort

Today, I gave a talk on implementing languages based on the ideas behind RPython and Truffle at FOSDEM on the main track. Please find abstract and slides below.

Thanks again also to the FOSDEM organizers for having me! :)

Abstract

This talk shows how languages can be implemented as self-optimizing interpreters, and how Truffle or RPython go about to just-in-time compile these interpreters to efficient native code.

Programming languages are never perfect, so people start building domain-specific languages to be able to solve their problems more easily. However, custom languages are often slow, or take enormous amounts of effort to be made fast by building custom compilers or virtual machines.

With the notion of self-optimizing interpreters, researchers proposed a way to implement languages easily and generate a JIT compiler from a simple interpreter. We explore the idea and experiment with it on top of RPython (of PyPy fame) with its meta-tracing JIT compiler, as well as Truffle, the JVM framework of Oracle Labs for self-optimizing interpreters.

In this talk, we show how a simple interpreter can reach the same order of magnitude of performance as the highly optimizing JVM for Java. We discuss the implementation on top of RPython as well as on top of Java with Truffle so that you can start right away, independent of whether you prefer the Python or JVM ecosystem.

While our own experiments focus on SOM, a little Smalltalk variant to keep things simple, other people have used this approach to improve peek performance of JRuby, or build languages such as JavaScript, R, and Python 3.

Slides

Partitioned Global Address Space Languages

More than a decade ago, programmer productivity was identified as one of the main hurdles for future parallel systems. The so-called Partitioned Global Address Space (PGAS) languages try to improve productivity and explore a range of language design ideas. These PGAS languages are designed for large-scale high-performance parallel programming and provide the notion of a globally shared address space, while exposing the notion of explicit locality on the language level. Even so the main focus is high-performance computing, the language ideas are also relevant for the parallel and concurrent programming world in general.

As part of our research in the field of parallelism and concurrency, we studied the PGAS languages more closely to get a better understanding of the specific concepts they explore and to get a feeling for the tradeoffs of the various language design options. The result is a survey of the major PGAS languages, which was very recently accepted for publication in the ACM Computing Surveys.

The preprint of the paper is available as PDF and HTML version. The final edited version will probably take another eternity to appear, but oh well, that’s academia.

Abstract

The Partitioned Global Address Space (PGAS) model is a parallel programming model that aims to improve programmer productivity while at the same time aiming for high performance. The main premise of PGAS is that a globally shared address space improves productivity, but that a distinction between local and remote data accesses is required to allow performance optimizations and to support scalability on large-scale parallel architectures. To this end, PGAS preserves the global address space while embracing awareness of non-uniform communication costs.

Today, about a dozen languages exist that adhere to the PGAS model. This survey proposes a definition and a taxonomy along four axes: how parallelism is introduced, how the address space is partitioned, how data is distributed among the partitions and finally how data is accessed across partitions. Our taxonomy reveals that today’s PGAS languages focus on distributing regular data and distinguish only between local and remote data access cost, whereas the distribution of irregular data and the adoption of richer data access cost models remain open challenges.

  • Partitioned Global Address Space Languages; Mattias De Wael, Stefan Marr, Bruno De Fraine, Tom Van Cutsem, Wolfgang De Meuter; ACM Computing Surveys, to appear.
  • Paper: PDF, HTML
  • DOI: TBA
  • BibTex: BibSonomy

SOM Performance Numbers

Today, I got a few more benchmarks running to get a better idea of where RTruffleSOM and TruffleSOM stand in terms of their absolute performance.

Measuring performance is always a tedious exercise, especially when trying to compare something to established systems.

So, what do we compare a simple Smalltalk like SOM to? Let’s go with Java and HotSpot as being widely regarded as ‘the’ state-of-the-art system, when it comes to dynamic compilation. Of course this means cross-language benchmarking. To minimize the apples and oranges aspect of such an undertaking, I transcribed most of the SOM benchmarks we got to Java 8. That means, I am using lambdas if that seems a good and somewhat idiomatic choice. Counting loops are just plain for loops of course, but iterations over collections can be nicely expressed with Java’s lambdas. The first interesting insight was that the old Java version of two of the larger benchmarks, namely Richards and DeltaBlue actually got a little faster. Even so, the new versions use getters/setters and lambdas everywhere, performance did not drop. I have to admit, I kind of hoped that HotSpot is going to struggle just a little bit, to make the SOM look a little better. But no, unfortunately, ‘good OO code’ seems to be somethings that HotSpot can appreciate.

Enough introduction, let’s look at the numbers:

Performance results for SOM, normalized the the results for Java 8 with HotSpot in server mode. Lower is better.

RTruffleSOM is the SOM version based on RPython with a meta-tracing just-in-time compiler. On average, RTruffleSOM is 4.7x slower than Java8 (min. 1.8x, max. 10.7x). So, there is still quite a bit of room for improvement. TruffleSOM is SOM implemented using Oracle Lab’s Truffle framework running on top of a JVM with the Graal JIT compiler. It is about 2.7x slower than Java8 (min. 3%, max. 4.7x). The Mandelbrot benchmark reaches Java’s performance, while some of the others are still quite a bit slower. Overall however, for a ‘simple’ interpreter, I think the 3x-slower-than-Java range is pretty good.

Of course, this is the performance after warmup and over multiple iterations. The plot is a box plot, and as you can see, the results are reasonably stable after warmup. Unfortunately, there are applications out there that might not run hot code all the time. So, one more question would be, how good are the interpreters?

So, let’s compare to the Java interpreter, which is used by giving Java the -Xint command-line option:

Performance results for SOM, normalized the the results for Java 8 with HotSpot in interpreter mode. Lower is better.The results here are not comparable to the pervious results. For the moment, the benchmarks use different parameters to avoid too long runtimes.

For RTruffleSOM, we see a 6x slowdown compared to the Java8 interpreter (min. 1.7x, max. 15.8x). The TruffleSOM interpreter is slightly faster, showing only a slowdown of 5.6x (min 1.9x, max. 13.5x). However, we run TruffleSOM on top of a JVM, so it still benefits from HotSpot’s just-in-time compiler. I also need to point out that both SOMs are run without their respective jit-compilation frameworks built in. For RTruffleSOM, this means we use a binary without meta-tracing support, and TruffleSOM runs on top of HotSpot without Graal. This means, these numbers are best-case interpreter numbers. Especially TruffleSOM is much slower in the interpreter mode on top of Graal since it records extensive profiling information to enable good JIT compilation, which leads to multiple factors slowdown in the worst case.

Overall, I am pretty happy with the performances of the little SOM interpreters. RTruffleSOM is roughly 5k lines of code and TruffleSOM about 10k LOC and still, both reach the same order of magnitude of performance as Java.

 

Are We There Yet? Simple Language-Implementation Techniques for the 21st Century

The first results of my experiments with self-optimizing interpreters was finally published in IEEE Software. It is a brief and very high-level comparison of the Truffle approach with a classic bytecode-based interpreter on top of RPython. If you aren’t familiar with either of these approaches, the article is hopefully a good starting point. The experiments described in it use SOM, a simple Smalltalk.

Since writing things down, the work on the different SOM implementations has continued resulting in better overall performance. This reminds me: thanks again to the communities around PyPy/RPython and Truffle/Graal for their continues support!

The preprint of the paper is available as PDF and HTML version. For the experiments, we also prepared an online appendix with a few more details and made the experimental setup available on GitHub.

Abstract

With the rise of domain-specific languages (DSLs), research in language implementation techniques regains importance. While DSLs can help to manage the domain’s complexity, it is rarely affordable to build highly optimizing compilers or virtual machines, and thus, performance remains an issue. Ideally, one would implement a simple interpreter and still reach acceptable performance levels. RPython and Truffle are two approaches that promise to facilitate language implementation based on simple interpreters, while reaching performance of the same order of magnitude as highly optimizing virtual machines. In this case study, we compare the two approaches to identify commonalities, weaknesses, and areas for further research to improve their utility for language implementations.

  • Are We There Yet? Simple Language Implementation Techniques for the 21st Century.; Stefan Marr, Tobias Pape, Wolfgang De Meuter; IEEE Software 31, no. 5, pp. 60-67.
  • Paper: PDF, HTMLonline appendix
  • DOI: 10.1109/MS.2014.98
  • BibTex: BibSonomy

Fork/Join Parallelism in the Wild: Documenting Patterns and Anti-Patterns in Java Programs using the Fork/Join Framework

Parallel programming is frequently claimed to be hard and all kind of approaches have been proposed to solve the complexity issues. The Fork/Join programming style introduced with Cilk enables the parallel decomposition of problems in a recursive divide-and-conquer style, and on the surface looks very simple with its minimalistic approach of having a fork and a join language construct. But is it actually simple to use? To find out, Mattias started to dig through the Java open source projects on GitHub and tried to identify common patterns. Next week, he will present our findings at PPPJ’14.

The preprint of the paper is available below. Additionally, Mattias made the information on the corpus and how to obtain it available.

Abstract

Now that multicore processors are commonplace, developing parallel software has escaped the confines of high-performance computing and enters the mainstream. The Fork/Join framework, for instance, is part of the standard Java platform since version 7. Fork/Join is a high-level parallel programming model advocated to make parallelizing recursive divide-and-conquer algorithms particularly easy. While, in theory, Fork/Join is a simple and effective technique to expose parallelism in applications, it has not been investigated before whether and how the technique is applied in practice. We therefore performed an empirical study on a corpus of 120 open source Java projects that use the framework for roughly 362 different tasks.

On the one hand, we confirm the frequent use of four best-practice patterns (Sequential Cutoff, Linked Subtasks, Leaf Tasks, and avoiding unnecessary forking) in actual projects. On the other hand, we also discovered three recurring anti-patterns that potentially limit parallel performance: sub-optimal use of Java collections when splitting tasks into subtasks as well as when merging the results of subtasks, and finally the inappropriate sharing of resources between tasks. We document these anti-patterns and study their impact on performance.

  • Fork/Join Parallelism in the Wild: Documenting Patterns and Anti-Patterns in Java Programs using the Fork/Join Framework; Mattias De Wael, Stefan Marr, Tom Van Cutsem; in ‘Proceedings of the 2014 International Conference on Principles and Practices of Programming on the Java Platform: Virtual Machines, Languages, and Tools’ , pp. 39-50.
  • Paper: PDF
  • BibTex: BibSonomy
  • Corpus and additional material: online appendix