This post is a brief overview of our new study of abstract-syntax-tree and bytecode interpreters on top of RPython and the GraalVM metacompilation systems, which we are presenting next week at OOPSLA.

Prelude: Why Did I Build a Language on Top of the Truffle and RPython Metacompilation Systems?

Writing this post, I realized that I have been working on interpreters on top of metacompilation systems for 10 years now. And in much of that time, it felt to me that widely held beliefs about interpreters did not quite match my own experience.

In summer 2013, I started to explore new directions for my research after finishing my PhD just in January that year. The one question that was still bugging me back then was how I could show that my PhD ideas of using metaobject protocols to realize different concurrency models was not just possible, but perhaps even practical.

During my PhD, I worked with a bytecode interpreter written in C++ and now had the choice to spend the next few years writing a just-in-time compiler for this thing, which is never going to be able to keep up with what state-of-the-art VMs offered, or finding another way to get good performance.

To my surprise, there was another way. Though even today, some say that metacompilation is ludicrous, and will never be practical. And others, simply use PyPy as their secret sauce and enjoy better Python performance…

Though, I didn’t start with PyPy, or rather its RPython metacompilation framework. Instead, I found the Truffle framework with its Graal just-in-time compiler and implemented the Simple Object Machine (SOM) on top of it, a dynamic language I had been using for a while already. Back then, Truffle and Graal were very new, and it took a while to get my TruffleSOM to reach the desired “state-of-the-art performance”, but I got there eventually. On the way to reaching that point, I also implemented PySOM on top of PyPy’s RPython framework. This was my way to get a better understanding of metacompilation more generally But enough of history…

The Promise of Metacompilation: State-of-the-Art Performance for Little Effort

If you’re Google, you can afford to finance your own JavaScript virtual machine with a state-of-the-art interpreter, state-of-the-art garbage collector, and no less than three different just-in-time compilers, which of course are also “state of the art”. Your average academic, and even large language communities such as those around Ruby, Python, and PHP do not have the resources to build such VMs.1 1 Of course, reality is more complicated, but I’ll skip over it.

This is where metacompilation comes in and promises us that we can reuse existing compilers, garbage collectors, and all the other parts of an existing high-level language VM. All we need to do is implement our language as an interpreter on top of something like the GraalVM or RPython systems.

That’s the promise. And, for a certain set of benchmarks, and a certain set of use cases, these systems deliver exactly that: reuse of these components. We still have to implement an interpreter suitable for these systems though. While this is no small feat, it’s something “an average” academic can do with enough time and stubbornness, and there are plenty of examples using Truffle and RPython.

And my SOM implementation, indeed manages to hold its own compared to Google’s V8:

Figure 1. Just-in-time-compiled peak performance of the Are We Fast Yet benchmarks, shown as an aggregate over all benchmarks on a logarithmic scale, with Java 17 (HotSpot) as the baseline. TSOMAST reaches the performance of Node.js, while the peak performance of the other implementations is a little further behind.

As we can see here, both V8 inside of Node.js and TSOMAST, which is short for the abstract-syntax-tree-based TruffleSOM interpreter, are roughly in the same range of being 1.7× to 2.7× slower than the HotSpot JVM. SOM as a dynamic language similar to JavaScript, and easily within a range of ±50% of performance to V8, I’d argue that metacompilation indeed lives up to its promise.

Of course, the used Are We Fast Yet benchmarks test only a relatively small common part of these languages, but they show that the core language elements common to Java, JavaScript, and other object-oriented languages reach roughly the same level of performance.

Interpreters for Metacompilation Systems

However, we wanted to talk about abstract-syntax-tree (AST) and bytecode interpreters. The goal of our work was to investigate the difference between these two approaches to build interpreters on top of GraalVM and RPython.

To this end, we had to implement no less than four interpreters, two AST interpreters and two bytecode interpreters, one each on the two different systems. We also had to optimize them roughly to the same level, so that we can draw conclusions from the comparison. While AST and bytecode interpreters naturally lend themselves to different optimizations, we didn’t stop at what’s commonly done. Instead, we implemented the classic optimizations to gain the key performance benefits, and once we hit diminishing returns, we added the optimizations from the AST interpreters to the bytecode ones or the other way around, so that they roughly implement the same set of optimizations. Currently, this includes (see Section 4 of the paper for more details):

  • polymorphic lookup/inline caching
  • inlining of control structures and anonymous functions in specific situations
  • superinstructions for bytecode interpreters and supernodes for AST interpreters
  • bytecode quickening and self-optimizing AST nodes
  • lowering/intrinsifying of basic standard library methods
  • caching of globals

Some other classic interpreter optimizations were not directly possible on top of the metacompilation systems. This includes indirect threading and top-of-stack caching for bytecode interpreters. Well, more precisely, we experimented a little with both, but they didn’t give the desired benefits and rather slowed the interpreters down, and we only made them work on a few benchmarks. Pushing this further will require extensive changes to the metacompilation systems as far as we can tell from talking to people working on GraalVM and RPython. So, future work…

With all these optimizations, we reached a point where adding further optimizations showed only minimal gains, typically specific to a benchmark, which gives us some confidence that our results are meaningful.

The final results are as follows:

Figure 2. Interpreter-only run-time performance of the Are We Fast Yet benchmarks, on a logarithmic scale, with Java 17 as baseline. While TSOM (TruffleSOM) and PySOM are overall slower than HotSpot's Java interpreter and Node.js/V8's Ignition interpreter, we can also observe that PySOMAST and TSOMAST are faster than the bytecode versions. TSOMBC is the slowest interpreter overall.

Since we are confident that both types of interpreters are optimized roughly to the same level, we conclude that bytecode interpreters do not have their traditional advantage on top of metacompilation systems when it comes to pure interpreter speed.

Based on our current understanding, we attribute that to some general challenges metacompilation systems face when producing native code for the bytecode interpreter loops. The GraalVM for instance, does use the Graal IR, which only supports structured control flow. This means, we cannot directly encode arbitrary jumps between bytecodes, and the compiler also struggles with the size of bytecode loops, leaving some optimization opportunities on the table. The bytecode loops of a standard interpreter can be multiple tens or hundreds of kilobytes of native code, where bytecode interpreters written in C/C++ are typically much more concise.

We also looked at memory use, and indeed, on that metric bytecode interpreters win compared to AST interpreters. However, the difference is not as stark as one might expect, and bytecode interpreters may also require more allocations based on how they are structured for boxing or other run-time data structures.

Though, this blog post is just a high-level overview. For the details, please see the paper below.

Recommendations: AST or Bytecode?

Based on our results, what would I do, if I had to implement another language on top of Truffle or RPython? Well, as always, it really depends… Let’s look at the two ends of the spectrum I can think of:

An Established, widely used Language: For this, I’d assume that a bytecode has been defined, and it is going to evolve slowly in the future. I’d also assume it has possibly large existing programs with a lot of code, i.e., hundreds of thousands of lines of code. For these types of languages, I’d suggest to stick with the bytecode. Bytecode is fast to load and lots of code is likely executed only once or not at all. This means the memory benefits of the compact representation likely outweigh other aspects, and we get decent performance.

A Completely new Language: Here I will assume the language will first need to find is way and design may frequently change. On top of metacompilation systems, we can get really good performance, and there’s not a lot of code for our language, and our users are unlikely to care too much about loading huge amounts of code. Here, AST interpreters in the Truffle style are likely the more flexible choice. You don’t need to design a bytecode, and can instead focus on the language and getting to acceptable performance first. Later, once you have larger code bases with their own performance challenges one may still think about designing a bytecode, but I would think the fewest languages will ever get there.

For languages in between these two ends of the spectrum, one would probably want to weigh up engineering effort, which I’d think to be lower for AST interpreters, and memory use for large code bases, where bytecode interpreters are better.

AST vs. Bytecode: Interpreters in the Age of Meta-Compilation

Our paper includes more on the background, our interpreter implementations, and our experimental design. It also has a more in-depth analysis on the performance properties we observed in terms of run time and memory use. To guide the work of language implementers, we also looked at the various optimizations, to see which ones are most important to gain interpreter as well as just-in-time compiled performance.

As artifact, our paper comes with a Docker image that includes all experiments and raw data, which hopefully enables others to reproduce our results and expand on or compare to our work. The Dockerfile itself is also on GitHub, where one can also find the latest versions of TruffleSOM and PySOM. Though these two don’t yet contain the same versions as the paper, and may evolve further.

The paper is the result of a collaboration with Octave, Humphrey, and Sophie.

Any comments or suggestions are also greatly appreciated perhaps on Twitter @smarr or Mastodon.

Abstract

Thanks to partial evaluation and meta-tracing, it became practical to build language implementations that reach state-of-the-art peak performance by implementing only an interpreter. Systems such as RPython and GraalVM provide components such as a garbage collector and just-in-time compiler in a language-agnostic manner, greatly reducing implementation effort. However, meta-compilation-based language implementations still need to improve further to reach the low memory use and fast warmup behavior that custom-built systems provide. A key element in this endeavor is interpreter performance. Folklore tells us that bytecode interpreters are superior to abstract-syntax-tree (AST) interpreters both in terms of memory use and run-time performance.

This work assesses the trade-offs between AST and bytecode interpreters to verify common assumptions and whether they hold in the context of meta-compilation systems. We implemented four interpreters, each an AST and a bytecode one using RPython and GraalVM. We keep the difference between the interpreters as small as feasible to be able to evaluate interpreter performance, peak performance, warmup, memory use, and the impact of individual optimizations.

Our results show that both systems indeed reach performance close to Node.js/V8. Looking at interpreter-only performance, our AST interpreters are on par with, or even slightly faster than their bytecode counterparts. After just-in-time compilation, the results are roughly on par. This means bytecode interpreters do not have their widely assumed performance advantage. However, we can confirm that bytecodes are more compact in memory than ASTs, which becomes relevant for larger applications. However, for smaller applications, we noticed that bytecode interpreters allocate more memory because boxing avoidance is not as applicable, and because the bytecode interpreter structure requires memory, e.g., for a reified stack.

Our results show AST interpreters to be competitive on top of meta-compilation systems. Together with possible engineering benefits, they should thus not be discounted so easily in favor of bytecode interpreters.

  • AST vs. Bytecode: Interpreters in the Age of Meta-Compilation
    O. Larose, S. Kaleba, H. Burchell, S. Marr; Proceedings of the ACM on Programming Languages, OOPSLA'23, p. 318–346, ACM, 2023.
  • Paper: HTML, PDF
  • DOI: 10.1145/3622808
  • Appendix: online appendix
  • BibTex: bibtex
    @article{Larose:2023:AstVsBc,
      abstract = {Thanks to partial evaluation and meta-tracing, it became practical to build language implementations that reach state-of-the-art peak performance by implementing only an interpreter. Systems such as RPython and GraalVM provide components such as a garbage collector and just-in-time compiler in a language-agnostic manner, greatly reducing implementation effort. However, meta-compilation-based language implementations still need to improve further to reach the low memory use and fast warmup behavior that custom-built systems provide. A key element in this endeavor is interpreter performance. Folklore tells us that bytecode interpreters are superior to abstract-syntax-tree (AST) interpreters both in terms of memory use and run-time performance.
      
      This work assesses the trade-offs between AST and bytecode interpreters to verify common assumptions and whether they hold in the context of meta-compilation systems. We implemented four interpreters, each an AST and a bytecode one using RPython and GraalVM. We keep the difference between the interpreters as small as feasible to be able to evaluate interpreter performance, peak performance, warmup, memory use, and the impact of individual optimizations.
      
      Our results show that both systems indeed reach performance close to Node.js/V8. Looking at interpreter-only performance, our AST interpreters are on par with, or even slightly faster than their bytecode counterparts. After just-in-time compilation, the results are roughly on par. This means bytecode interpreters do not have their widely assumed performance advantage. However, we can confirm that bytecodes are more compact in memory than ASTs, which becomes relevant for larger applications. However, for smaller applications, we noticed that bytecode interpreters allocate more memory because boxing avoidance is not as applicable, and because the bytecode interpreter structure requires memory, e.g., for a reified stack.
      
      Our results show AST interpreters to be competitive on top of meta-compilation systems. Together with possible engineering benefits, they should thus not be discounted so easily in favor of bytecode interpreters.},
      appendix = {https://doi.org/10.5281/zenodo.8147414},
      articleno = {233},
      author = {Larose, Octave and Kaleba, Sophie and Burchell, Humphrey and Marr, Stefan},
      blog = {https://stefan-marr.de/2023/10/ast-vs-bytecode-interpreters/},
      doi = {10.1145/3622808},
      html = {https://stefan-marr.de/papers/oopsla-larose-et-al-ast-vs-bytecode-interpreters-in-the-age-of-meta-compilation/},
      issn = {2475-1421},
      journal = {Proceedings of the ACM on Programming Languages},
      keywords = {AST Bytecode CaseStudy Comparison Interpreter JITCompilation MeMyPublication MetaTracing PartialEvaluation myown},
      month = oct,
      number = {OOPSLA2},
      numpages = {29},
      pages = {318--346},
      pdf = {https://stefan-marr.de/downloads/oopsla23-larose-et-al-ast-vs-bytecode-interpreters-in-the-age-of-meta-compilation.pdf},
      publisher = {{ACM}},
      series = {OOPSLA'23},
      title = {AST vs. Bytecode: Interpreters in the Age of Meta-Compilation},
      volume = {7},
      year = {2023},
      month_numeric = {10}
    }