Interpreter Generators: A Brief Look at Existing Work

Motivated by Tiger, a tool for generating interpreters, being mentioned on Twitter, I had a brief look at vmgen, Tiger, eJSTK, Truffle DSL, and DynSem. What follows are my rather rough notes and pointers. So, this is by no means a careful literature study, and I welcome further pointers.

vmgen: A DSL for Interpreters

vmgen [1, 2] might be one of the seminal projects on interpreter research. It has been used to evaluate optimizations such as superinstructions and stack caching.

It provides a domain-specific language (DSL) to specify the semantics of bytecodes as operations on a stack. These specifications are then used to generate the interpreter, possibly as threaded code interpreter, and allow vmgen to apply optimizations to the bytecodes or construct superinstructions.

The instruction definition for an integer addition looks something like this (see [1]):

iadd (i1 i2 -- i)
i = i1 + i2;

The first line encodes the stack operations, and says that two integers are popped of the stack, and one is return as result. The variable names can then be used in the second line to define the actual operation using C code. The language also supports defining register machines and superinstructions.

Overall the DSL is a relatively thin layer over the C target code, which gives a lot of control while enabling a high degree of automation.

Tiger: vmgen 2.0

Tiger is the direct successor of vmgen, and was used for instance to evaluate the benefits of static and dynamic interpreter code replication to help branch predictors.

From a language perspective, it looks like an extension of vmgen:

ADD SP( int a, int b - int c )
    IP( - next);

The main difference in the above example is that stack (SP) and instruction stream (IP) behavior are made more explicit. In addition to vmgen’s support for superinstructions, Tiger supports instruction specialization:


This would create a specialized instruction that pushes the 0 integer value. This can be used together with the superinstruction support, for instance to create an instruction to increment a value:


To utilize the processors pipelining and out-of-order execution, Tiger allows to load the address for the next instruction early. It also supports to defer read/write operations for instance if they are only needed in one of the branches. This gives Tiger a few more low-level features to have more control over what the generated interpreter is doing and how it performs on a CPU.

The most complete overview is likely to be found in K. Casey’s dissertation [3, 4], which also has a good overview of interpreter optimizations.

Truffle DSL: A DSL for Self-Optimizing Interpreters

Skipping ahead a decade, Truffle DSL [5, 6] seems the most related bit of work I am aware of. It’s quite different from vmgen and Tiger in that it’s a DSL for self-optimizing interpreters and focuses specifically on the self-optimizing aspect. It is part of the GraalVM project and used in combination with a system that provides just-in-time compilation based on partial evaluation for the languages built using the Truffle framework. Thus, one of the key differences is that it is not targeted at solving the low-level performance issues of bytecode-like interpreters. Instead, it provides mechanisms to conveniently express a language’s semantics and allow the definition of the various special cases dynamic languages have for even basic operations such as addition.

The classic example is the addition operation:

abstract class AddNode extends Node {
  int addIntInt(int left, int right) {
    return left + right;

  String addStringInt(String left, int right) {
    return left + right;

The above class represents an addition node. The Truffle DSL generates a concrete Java class to implement the abstract one, providing the glue code to manage the defined @Specialization methods. It checks that the arguments for the addition, and depending on the seen types, will activate the appropriate one. In this case both arguments are either expected to be integers, or the left one a string. When the left one is indeed a string, then the addStringInt() method is executed. The interpreter keeps track of which specializations were used for a specific lexical location to guide the just-in-time compilation and enables it to reach peak performance competitive with state-of-the-art JIT compiling VMs.

The DSL supports also expression guards, to not just rely on types but for instance check values to distinguish different special cases. It also supports the notion of assumptions, a mechanism that acts globally and allows to invalidate code that is not needed anymore, perhaps because it is inconsistent with some global property, perhaps that no subclass of a specific class is loaded.

To minimize interpreter overhead, it also supports boxing elimination for specializations by generating type-specialized code paths. This allows for instance the addIntInt() method to directly get unboxed integers and returns an unboxed integer, which avoids unboxing and reboxing, a common problem for interpreters with uniform data representations.

In addition, the Truffle framework comes with various other elements, for instance frame objects that type specialize, an object storage model for the same purpose, and so-called Truffle libraries, which is a mechanism to generate glue code for custom data structures with specializing representations.

eJSTK: A DSL for Type Specializations

eJSTK [7], the embedded JavaScript Tool Kit, is an approach to generate custom JavaScript virtual machines (incl. the interpreters), specialized to a specific program. Dynamic languages like JavaScript have complex semantics, and even seemingly simple operations such as addition require a lot of code to be implemented for the general case. eJSTK’s goal is to identify which semantics a program actually needs to reduce the interpreter to the minimum necessary implementation and fit it even on extremely resource constraint systems, by minimizing the binary size of the interpreter.

The main idea to achieve this is to specialize the interpreter based on the types and operations needed for the execution of a program. To enable this specialization, it defines the bytecodes in terms of specializations similar to the Truffle DSL. Though, instead of generating Java code from it, eJSTK generates a C interpreter.

\inst add (Register dst, Value v1, Value v2)
\when v1:fixnum && v2:fixnum \{
  int s = int_from_fixnum(v1) + int_from_fixnum(v2);
  dst = int_to_number(s);
\when v1:string && (v2:fixnum || v2:flonum || v2:special) \{
  v2 = to_string(context, v2);
  goto add_STRSTR;
\when v1:string && v2:string \{
  dst = concat_to_string(context, cstr_from_string(v1), cstr_from_string(v2));
// ... omitted various cases
\otherwise \{
  double x1 = convert_to_double(context, v1);
  double x2 = convert_to_double(context, v2);
  dst = double_to_number(x1 + x2);

The above example is adapted from [7] and illustrates the type-based specializations. The interpreter generator uses it together with information of which types a JavaScript program will use at run time to only generate the minimum necessary set of specializations, construct decision trees for type tests, and possibly also generate superinstructions [8].

Compared to the Truffle DSL, it’s more low-level, in spirit closer to vmgen and Tiger, but also has fewer features around interpreter optimizations.

DynSem: Specifying Dynamic Semantics of Languages

Somewhat related, but with a much broader scope is DynSem [1, 2]. Part of the Spoofax language workbench, it is a language to define the whole semantics of a language in terms of a term-rewriting system. Thus, in a type of formal semantics. Not sure what the current state of the project is, but in the past, the specification itself was executed by a DynSem interpreter. Though, the authors talked also about compiling the specification down to a Truffle interpreter specific to the language specified in DynSem.

E ⊦ e1 :: H1 ⟶ IntV(i) :: H2;
E ⊦ e2 :: H2 ⟶ IntV(j) :: H3;
IntV(addI(i, j))⇒ v
E ⊦ Plus(e1, e2) :: H1 ⟶ v :: H3

The above defines the addition semantics by roughly saying if there’s a Plus(e1, e2) in a program, it will evaluate to a value v. Though, it requires that e1 and e2 both evaluate to an integer value IntV, which then can be added using the addI() function.

Arguable, DynSem is on another abstraction level to the other things mentioned here. Though, I suppose in the future, it would be nice to connect these things and get really fast interpreters from a definition possibly as high-level as DynSem.


vmgen and Tiger are fairly low-level and focus on classic interpreter optimizations. Though, for dynamic languages such as JavaScript, Ruby, and Python, one benefits likely from specifying the different cases for an operation based on for instance the types of arguments as we see with Truffle DSL and eJSTK. There’s a huge gap to DynSem, and I am not sure it will be bridged any time soon. Though, before than, it might be beneficial to bring some of the aspects of vmgen, Tiger, and eJSTK into a setting like Truffle DSL to benefit from optimizations such as superinstructions or even supernodes.

If you have any comments, suggestions, or pointers, would be great to have a discussion on Twitter.

Open Postdoc Position on Language Implementation and Concurrency

King Arthur watching over project Camelot.

We have an open Postdoc position here in the Programming Languages and Systems group at Kent.

It’s a 2 year position in our CaMELot research project, where we want to “Catch and Mitigate Event-Loop Concurrency Issues”.

Even so event loops prevent various low-level concurrency issues, we are still left with many high-level ones [1]. But, there are various mechanisms that allow us to prevent concurrency bugs from causing harm [2, 3, 4] even though they may have slipped into our software.

The problem with current approaches is that they typically have high run-time overhead. The techniques for detecting concurrency issues such as race conditions and message interleaving issues can slow down systems too much, making them impractical.

And we want to change that by creatively borrowing language implementation ideas [5, 6].

If this sounds exciting to you, please reach out, and apply for our position. We’re a pretty friendly and supportive bunch here at Kent and hope this position can be your stepping stone whether you want to stay in academia or perhaps go into industrial research.

If you have any questions, wonder whether you qualify, or are simply curious, please send me an email or a message @smarr on Twitter.

Towards a Synthetic Benchmark to Assess VM Startup, Warmup, and Cold-Code Performance

One of the hard problems in language implementation research is benchmarking. Some people argue, we should benchmark only applications that actually matter to people. Though, this has various issues. Often, such applications are embedded in larger systems, and it’s hard to isolate the relevant parts. In many cases, these applications can also not be made available to other researchers. And, of course, things change over time, which means maintaining projects like DaCapo, Renaissance, or Jet Stream is a huge effort.

Which brought me to the perhaps futile question of how we could have more realistic synthetic benchmarks. ACDC and ACDC-JS are synthetic garbage collection benchmarks. While they don’t seem to be widely used, they seemed to have been a useful tool for specific tasks. Based on observing metrics for a range of relevant programs, these synthetic benchmarks were constructed to be configurable and allow us to measure a range of realistic behaviors.

I am currently interested in the startup, warmup, and cold-code performance of virtual machines, and want to study their performance issues. To me it seems that I need to look at large programs to get interesting and relevant results. With large, I mean millions of lines of code, because that’s where our systems currently struggle. So, how could we go about to create a synthetic benchmark for huge code bases?

Generating Random Code that Looks Real

In my last two blog posts [1, 2], I looked at the shape of large code bases in Pharo and Ruby to obtain data for a new kind of synthetic benchmark. I want to try to generate random code that looks “real”. And by looking real I mean for the moment that it is shaped in a way that is similar to real code. This means, methods have realistic length, number of local variables, and arguments. For classes, they should have a realistic number of methods and instance variables. In the last two blog posts, I looked at the corresponding static code metrics to get an idea of how large code bases actually look like.

In this post, I am setting out to use the data to get a random number generator that can be used to generate “realistic looking” code bases. Of course, this doesn’t mean that the code does do anything realistic.

Small steps… One at a time… 👨🏼‍🔬

So, let’s get started by looking at how the length of methods looks like in large code bases.

Before we get started, just one more simplification: I will only consider methods that have 1 to 30 lines (of code). Setting an upper bound will make some of the steps here simpler, and plots more legible.

And a perhaps little silly, but nonetheless an issue, it will avoid me having to change one of the language implementations I am interested in, which is unfortunately limited to 128 bytecodes and 128 literals (constants, method names, etc.), which in practice translates to something like 30 lines of code. While this could be fixed, let’s assume 30 lines of code per method ought to be enough for anybody…

Length of Methods

When it comes to measuring the length of methods, there are plenty of possibly ways to go about. Pharo counts the non-empty lines. And for Ruby, I counted either all lines, or the lines that are not just empty and not just comments.

The histogram below shows the results for methods with 1-30 lines.

Distribution of the length of methods.

Despite difference in languages and metrics, we see a pretty similar shape. Perhaps with the exception of the methods with 1-3 lines.

Generating Realistic Method Length from Uniform Random Numbers

Before actually generating method length randomly, let’s define the goal a bit more clearly.

In the end, I do want to be able to generate a code base where the length of methods has a distribution very similar to what we see for Ruby and Pharo.

Though, the random number generators we have in most systems generate numbers in a uniform distribution typically in the range from 0 to 1. This means, each number between 0 and 1 is going to be equally likely to be picked. To get other kinds of distributions, for instance the normal distribution, we can use what is called the inverted cumulative distribution function. When we throw our uniformly distributed numbers into this function, we should end up with random numbers that are distributed according to the distribution that we want.

One of the options to do this would be:

  1. determine the cumulative distribution of the method length
  2. approximate a function to represent the cumulative distribution
  3. and invert the function

I found this post here helpful. Though, I struggled defining a good enough function to get results I liked.

So, instead, let’s do it the pedestrian way:

  1. calculate the cumulative sum for the method length (cumulative distribution)
  2. normalize it to the sum of all lengths
  3. use the result to look up the desired method length for a uniform random number

Determining the Cumulative Distribution

Ok, so, the first step is to determine the cumulative distribution. Since we have the three different cases for Pharo, Ruby with all lines and lines of code, this is slightly more interesting.

Percentage of methods with a specific length in lines or LOC.

The plot above shows the percentage of methods that have a length smaller or equal to a specific size.

So, the next question is, which metrics should I choose? Since the data is a bit noisy, especially for small methods, let’s try and see what the different types of means give us.

Different means applied to the cumulative percentage of methods for a given length.

From the above plot, the geometric mean seems a good option. Mostly because I don’t want to have a too high and too low number of methods with a single line.

Using the geometric mean, gives us the following partial cumulative distribution table:

length cum.perc
1 0.0520631
2 0.2647386
3 0.5137505
4 0.6068893
5 0.6851248
6 0.7377313
7 0.7861208
8 0.8200427
9 0.8490800

In R, the language I use for these blog posts, I can then use something like the following to take a uniform random number from the range of 0-1 to determine the desired method length in the range of 1-30 lines (u being here the random number):

loc_from_u <- function (u) {
  Position(function (e) { u < e }, cumulative_distribution_tbl)

There are probably more efficient ways of going about it. I suppose a binary search would be a good option, too.

The general idea is that with our random number u, we find the last position in our array with the cumulative distribution, where u is smaller than the value in the array at that position. The position then corresponds to the desired length of a method.

Three examples of methods generated, for 100, 1,000, or 100,000 methods.

As a test, the three plots above are generated from 100, 1,000, and 100,000 uniformly distributed random numbers, and it looks pretty good. Comparing to the very first set of plots in this post, this seems like a workable and relatively straightforward approach.

To use these results and generate methods of realistic sizes in other languages, the full cumulative distribution is as follows: [0.0520631241473676, 0.264738601144803, 0.51375051909561, 0.606889305644881, 0.685124787391578, 0.737731315373305, 0.786120782303596, 0.820042695503066, 0.849080035429476, 0.872949669419948, 0.893437469528804, 0.909716501217452, 0.923766913966731, 0.935357118689879, 0.945074445934092, 0.953001059092301, 0.959743413722937, 0.965457396618992, 0.97072530951053, 0.975142363341172, 0.979105371695575, 0.982654867280203, 0.985723232507825, 0.988399223471247, 0.990960559703172, 0.993172997617124, 0.9951015059492, 0.996855138214434, 0.998541672752458, 1].

Method Arguments and Local Variables

With the basics down, we can look at the number of arguments and local variables of methods. One thing I haven’t really thought about in the previous posts is that there’s a connection between the various metrics. They are not independent of each other.

Perhaps this is most intuitive for the number of local variables a method has. We wouldn’t expect a method with a single line of code to have many local variables, while longer methods may tent to have more local variables, too.

Number of Method Arguments

Let’s start out by looking at how method length and number of arguments relate to each other.

I’ll use the cumulative distribution for these plots, since that’s what I am looking for in the end.

Percentage of methods with a specific number of arguments.

The two plots above show for each method length from 1-30 a line (so, this is where limiting the method length becomes actually handy). Though, because there are many, I highlight only every third length, including length 1 methods. The bluest blue is length 1, and the red is length 30.

We can see here differences between the languages. For instance, for methods with only 1 line in Pharo, only ≈45% of them have no argument. While for Ruby methods, that’s perhaps around 70%.

The other interesting bit that is clearly visible is that the number of arguments doesn’t have a simple direct relationship to length. Indeed, longer methods seem to have more likely fewer arguments. While medium length methods are more likely to have a few more arguments, at least for the Ruby data this seems to be the case.

So, from these plots, I conclude that I actually need a different cumulative distribution table for each method length. Since we saw how they look for method length, I won’t include the details. Though, of course happy to share the data if anyone wants it.

Number of Methods Locals

Next up, let’s look at the number of locals.

Percentage of methods with a specific number of local variables.

For the Pharo data, it’s not super readable, but basically 100% of methods of length 1 have zero local variables. Compared to the plot on arguments, we also see a pretty direct relationship to length, because the blue-to-red gradient comes out nicely in the plot.

In the case of Ruby, this seems to be similar, but perhaps not as cleanly as for Pharo. The different y-axis start points are also interesting, because they indicate that longer methods in Pharo are more likely to have arguments than in Ruby.

For generating code, I suppose one needs to select the distributions that are most relevant for one’s goal.

Classes: Number of Methods and Fields

After looking at properties for methods, let’s look at classes. I fear, these various metrics are pretty tangled up, and one could probably find many more interesting relationships between them, but I’ll restrict myself for this post to the most basic ones. First I’ll look at the cumulative distribution for the number of methods per class, and then look at the number of instance variables classes have depending on their size.

Number of Methods per Class

I’ll restrict my analysis here to classes with a maximum of 100 methods, because the data I have does not include enough classes with more than 100 methods.

Percentage of classes with a specific number of methods.

As we saw in the previous post, we can here see that Ruby has many more classes with only one or two methods. On the other hand, it seems to have slightly fewer larger classes. Expressed differently, about 60% of all Ruby methods (which includes closures) have 1 or 2 arguments, while in the case of Pharo (where closures where ignored), we need about 5-6 arguments to reach the same level.

Number of Fields for Classes with a Specific Number of Methods

For the number of fields of a class, I can easily look at the relation to the number of methods, too.

Percentage of classes with a specific number of fields.

In the two plots above, we see that there is an almost clear relationship between the number of methods and fields. A class having more methods seems to indicate that is may have more fields. For both languages, there’s some middle ground where things are not as clear, but at least for classes with fewer methods, it seems to hold well.


The most important insight for me from this exercise is that I can generate code that has a realistic shape relatively straightforwardly based on the data collected.

At least, it seems easy and reliable to get a random number distribution of the desired shape.

In addition, we saw that there are indeed interdependencies between the different metrics. This is not too surprising, but something one needs to keep in mind when generating “realistic” code.

So, where from here? Well, I already got a code generator that can generate millions of lines of code that use basic arithmetic operations. The next step would be to fit this code into a shape that’s more realistic. One problem I had before is that my generated code started stressing things like the method look up, in a way real code doesn’t. Shaping things more realistically, will help avoid optimizing things that may not matter. Then again, we see pathologic cases also in real code.

Ending this post, there are of course more open questions, for instance:

  • how do I generate realistic behavior?
  • do large chunks of generated code allow me to study warmup and cold-code performance in a meaningful way? Or asked differently, does the generated code behave similar enough to real code?
  • which other metrics are relevant to generate realistic code?

Though, I fear, I’ll need to wait until spring or summer to revisit those questions.

For suggestions, comments, or questions, find me on Twitter @smarr.

Older Posts

Subscribe via RSS