Why Is Concurrent Programming Hard? And What Can We Do about It? #vmm2015

Yesterday at the Virtual Machine Meetup, I was giving a talk about why I think concurrent programming is hard, and what we can do about it.

The talk is very much related to an earlier blog post with the same title. My main point is that concurrent programming is hard because on the one hand there is not a single concurrency abstraction that fits all problems, and on the other hand the various different abstractions are rarely designed to be used in combination with each other. First work in that direction makes us hopeful that we can actually adapt interesting sets of concurrency abstractions to work together without causing combination issues as for instance unexpected deadlocks or data races.

The blog post, and the slide set below hopefully give a little more insights on what we got in mind.

Zero-Overhead Metaprogramming

Runtime metaprogramming and reflection are slow. That’s a common wisdom. Unfortunately. Using refection for instance with Java’s reflection API, its dynamic proxies, Ruby’s #send or #method_missing, PHP’s magic methods such as __call, Python’s __getattr__, C#’s DynamicObjects, or really any metaprogramming abstraction in modern languages unfortunately comes at a price. The fewest language implementations optimize these operations. For instance, on Java’s HotSpot VM, reflective method invocation and dynamic proxies have an overhead of 6-7x compared to direct operations.

But, does it have to be that way? No it doesn’t!

And actually, a solution is rather simple. In a paper that Chris Seaton, Stéphane Ducasse, and I worked on, and which recently got accepted for presentation at the PLDI conference, we show that a simple generalization of polymorphic inline caches can be used to optimize metaprogramming so that it doesn’t have any performance cost after just-in-time compilation.

You might wonder, why do we care? Since it is slow, people don’t use it in performance sensitive code, right? Well, as it turns out, in Ruby it is used everywhere, because it is convenient and allows for straightforward and general solutions. So, making metaprogramming fast will benefit many applications. But that’s not all. For my own research on concurrency, I proposed the ownership-based metaobject protocol (OMOP) as a foundation for implementing a wide range of different concurrent programming abstractions. Unfortunately, such metaobject protocols have been inherently difficult to optimize. But instead of finding a solution, researchers gave up on it and instead focused on designing aspect-oriented programming languages, which sidestep the performance issues by applying only to a minimal set of program points instead of pervasively throughout the whole program. For my use case, that wasn’t good enough. However, now, by generalizing polymorphic inline caches, we solved the performance issues of metaobject protocols as well.

The abstract of the paper and the PDF/HTML versions, as well as the artifact with all experiments are linked below.


Runtime metaprogramming enables many useful applications and is often a convenient solution to solve problems in a generic way, which makes it widely used in frameworks, middleware, and domain-specific languages. However, powerful metaobject protocols are rarely supported and even common concepts such as reflective method invocation or dynamic proxies are not optimized. Solutions proposed in literature either restrict the metaprogramming capabilities or require application or library developers to apply performance improving techniques.

For overhead-free runtime metaprogramming, we demonstrate that dispatch chains, a generalized form of polymorphic inline caches common to self-optimizing interpreters, are a simple optimization at the language-implementation level. Our evaluation with self-optimizing interpreters shows that unrestricted metaobject protocols can be realized for the first time without runtime overhead, and that this optimization is applicable for just-in-time compilation of interpreters based on meta-tracing as well as partial evaluation. In this context, we also demonstrate that optimizing common reflective operations can lead to significant performance improvements for existing applications.

  • Zero-Overhead Metaprogramming: Reflection and Metaobject Protocols Fast and without Compromises; Stefan Marr, Chris Seaton, Stéphane Ducasse; in ‘Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation’ (PLDI’ 15).
  • Paper: PDF, HTML
  • BibTex: BibSonomy
  • DOI: 10.1145/2737924.2737963
  • Online Appendix: artifacts and experimental setup


FOSDEM 2015: Building High-Performance Language Implementations With Low Effort

Today, I gave a talk on implementing languages based on the ideas behind RPython and Truffle at FOSDEM on the main track. Please find abstract and slides below.

Thanks again also to the FOSDEM organizers for having me! :)


This talk shows how languages can be implemented as self-optimizing interpreters, and how Truffle or RPython go about to just-in-time compile these interpreters to efficient native code.

Programming languages are never perfect, so people start building domain-specific languages to be able to solve their problems more easily. However, custom languages are often slow, or take enormous amounts of effort to be made fast by building custom compilers or virtual machines.

With the notion of self-optimizing interpreters, researchers proposed a way to implement languages easily and generate a JIT compiler from a simple interpreter. We explore the idea and experiment with it on top of RPython (of PyPy fame) with its meta-tracing JIT compiler, as well as Truffle, the JVM framework of Oracle Labs for self-optimizing interpreters.

In this talk, we show how a simple interpreter can reach the same order of magnitude of performance as the highly optimizing JVM for Java. We discuss the implementation on top of RPython as well as on top of Java with Truffle so that you can start right away, independent of whether you prefer the Python or JVM ecosystem.

While our own experiments focus on SOM, a little Smalltalk variant to keep things simple, other people have used this approach to improve peek performance of JRuby, or build languages such as JavaScript, R, and Python 3.


Partitioned Global Address Space Languages

More than a decade ago, programmer productivity was identified as one of the main hurdles for future parallel systems. The so-called Partitioned Global Address Space (PGAS) languages try to improve productivity and explore a range of language design ideas. These PGAS languages are designed for large-scale high-performance parallel programming and provide the notion of a globally shared address space, while exposing the notion of explicit locality on the language level. Even so the main focus is high-performance computing, the language ideas are also relevant for the parallel and concurrent programming world in general.

As part of our research in the field of parallelism and concurrency, we studied the PGAS languages more closely to get a better understanding of the specific concepts they explore and to get a feeling for the tradeoffs of the various language design options. The result is a survey of the major PGAS languages, which was very recently accepted for publication in the ACM Computing Surveys.

The preprint of the paper is available as PDF and HTML version. The final edited version will probably take another eternity to appear, but oh well, that’s academia.


The Partitioned Global Address Space (PGAS) model is a parallel programming model that aims to improve programmer productivity while at the same time aiming for high performance. The main premise of PGAS is that a globally shared address space improves productivity, but that a distinction between local and remote data accesses is required to allow performance optimizations and to support scalability on large-scale parallel architectures. To this end, PGAS preserves the global address space while embracing awareness of non-uniform communication costs.

Today, about a dozen languages exist that adhere to the PGAS model. This survey proposes a definition and a taxonomy along four axes: how parallelism is introduced, how the address space is partitioned, how data is distributed among the partitions and finally how data is accessed across partitions. Our taxonomy reveals that today’s PGAS languages focus on distributing regular data and distinguish only between local and remote data access cost, whereas the distribution of irregular data and the adoption of richer data access cost models remain open challenges.

  • Partitioned Global Address Space Languages; Mattias De Wael, Stefan Marr, Bruno De Fraine, Tom Van Cutsem, Wolfgang De Meuter; ACM Computing Surveys, to appear.
  • Paper: PDF, HTML
  • DOI: TBA
  • BibTex: BibSonomy

SOM Performance Numbers

Today, I got a few more benchmarks running to get a better idea of where RTruffleSOM and TruffleSOM stand in terms of their absolute performance.

Measuring performance is always a tedious exercise, especially when trying to compare something to established systems.

So, what do we compare a simple Smalltalk like SOM to? Let’s go with Java and HotSpot as being widely regarded as ‘the’ state-of-the-art system, when it comes to dynamic compilation. Of course this means cross-language benchmarking. To minimize the apples and oranges aspect of such an undertaking, I transcribed most of the SOM benchmarks we got to Java 8. That means, I am using lambdas if that seems a good and somewhat idiomatic choice. Counting loops are just plain for loops of course, but iterations over collections can be nicely expressed with Java’s lambdas. The first interesting insight was that the old Java version of two of the larger benchmarks, namely Richards and DeltaBlue actually got a little faster. Even so, the new versions use getters/setters and lambdas everywhere, performance did not drop. I have to admit, I kind of hoped that HotSpot is going to struggle just a little bit, to make the SOM look a little better. But no, unfortunately, ‘good OO code’ seems to be somethings that HotSpot can appreciate.

Enough introduction, let’s look at the numbers:

Performance results for SOM, normalized the the results for Java 8 with HotSpot in server mode. Lower is better.

RTruffleSOM is the SOM version based on RPython with a meta-tracing just-in-time compiler. On average, RTruffleSOM is 4.7x slower than Java8 (min. 1.8x, max. 10.7x). So, there is still quite a bit of room for improvement. TruffleSOM is SOM implemented using Oracle Lab’s Truffle framework running on top of a JVM with the Graal JIT compiler. It is about 2.7x slower than Java8 (min. 3%, max. 4.7x). The Mandelbrot benchmark reaches Java’s performance, while some of the others are still quite a bit slower. Overall however, for a ‘simple’ interpreter, I think the 3x-slower-than-Java range is pretty good.

Of course, this is the performance after warmup and over multiple iterations. The plot is a box plot, and as you can see, the results are reasonably stable after warmup. Unfortunately, there are applications out there that might not run hot code all the time. So, one more question would be, how good are the interpreters?

So, let’s compare to the Java interpreter, which is used by giving Java the -Xint command-line option:

Performance results for SOM, normalized the the results for Java 8 with HotSpot in interpreter mode. Lower is better.The results here are not comparable to the pervious results. For the moment, the benchmarks use different parameters to avoid too long runtimes.

For RTruffleSOM, we see a 6x slowdown compared to the Java8 interpreter (min. 1.7x, max. 15.8x). The TruffleSOM interpreter is slightly faster, showing only a slowdown of 5.6x (min 1.9x, max. 13.5x). However, we run TruffleSOM on top of a JVM, so it still benefits from HotSpot’s just-in-time compiler. I also need to point out that both SOMs are run without their respective jit-compilation frameworks built in. For RTruffleSOM, this means we use a binary without meta-tracing support, and TruffleSOM runs on top of HotSpot without Graal. This means, these numbers are best-case interpreter numbers. Especially TruffleSOM is much slower in the interpreter mode on top of Graal since it records extensive profiling information to enable good JIT compilation, which leads to multiple factors slowdown in the worst case.

Overall, I am pretty happy with the performances of the little SOM interpreters. RTruffleSOM is roughly 5k lines of code and TruffleSOM about 10k LOC and still, both reach the same order of magnitude of performance as Java.