Interpreters, Compilation, and Concurrency Tooling in PLAS at Kent

Here at Kent, we have a large group of researchers working on Programming Languages and Systems (PLAS), and within this group, we have a small team focusing on research on interpreters, compilation, and tooling to make programming easier.

It’s summer 2021, and I felt it’s time for a small inventory of the things we are up to. At this very moment, the team consists of Sophie, Octave, and myself. I’ll include Dominik and Carmen, as well, for bragging rights. Though, they are either finishing the PhD or just recently defended it.

If you find this work interesting, perhaps you’d like to join us for a postdoc?

Utilizing Program Phases for Better Performance

Sophie recently presented her early work at the ICOOOLPS workshop. Modern language virtual machines do a lot of work behind the scenes to understand what our programs are doing.

However, techniques such as inlining and lookup caches have limitations. While one could see inlining as a way to give extra top-down context to a compilation, it’s inherently limited because of the overhead of run-time compilation and excessive code generation.

To bring top-down context more explicitly into these systems, Sophie explores the notion of execution phases. Programs often do different things one after another, perhaps first loading data, then processing it, and finally generating output. Our goal is to utilize these phases, to help compilers produce better code for each of them. To give just one example, here a bit from one of Sophie’s ICOOOLPS slides:

Screenshot of RStudio

The green line there is the microbenchmark with phase-based splitting enabled, giving a nice speedup for the second and fourth phase, benefitting from monomorphization and only compiling whats important for the phase, and not for the whole program.

Sophie has already shown that there is potential, and the discussions at ICOOOLPS lead to a number of new ideas for experiments, but it’s still early days.

Generating Benchmarks to Avoid the Tiny-Benchmark Trap

In some earlier blog posts and a talk at MoreVMs’21 (recording), I argued that we need better benchmarks for our language implementations. It’s a well known issue that the benchmarks that academia uses for research are rarely a good representation of what real systems do. Often, simply because they are tiny. Thus, I want to be able to monitor a system in production and then generate benchmarks that can be freely shared with other researchers from the behavior that we saw. Octave is currently working on such a system to generate benchmarks from abstract structural and behavioral data about an application.

It’s a long way, and Octave currently instruments Java applications, record what they do at run time, and generate benchmark with similar behavior from that. Given that Java isn’t the smallest language, there’s a lot to be done, but I hope we’ll have a first idea of whether this could work by the end of the summer 🤞🏻

Reproducing Nondeterminism in Multiparadigm Concurrent Programs

Dominik is currently writing up his dissertation on reproducing nondeterminism. His work is essentially all around tracing and record & replay of concurrent systems. We wrote a number of papers [1, 2, 3] which developed efficient ways of doing this first just for actor programs, and then for various other high-level concurrency models. The end result allows us to record & replay programs that combine various concurrency models, with very low overhead. This is the kind of technology that is needed to reliably debug concurrency issues, and perhaps in the future even allow for automatic mitigation!

Advanced Debugging Techniques to Handle Concurrency Bugs in Actor-based Applications

Last month, Carmen successfully defended her PhD on debugging techniques for actor-based applications. This work focused on the user-side of debugging. First, we built a debugger for all kind of concurrency models, looked at what kind of bugs we should worry about, and finally she did a user study on a new debugger she built to address these issues. We also worked on enabling the exploration of all possible executions and concurrency bugs of a program. The Voyager Multiverse Debugger has a nice demo created by our collaborators, which shows that we can navigate all possible executions paths of non-deterministic programs:

Preventing Concurrency Issues, Automatically, At Run Time

While all these projects are very dear to my heart, there’s one, I’d really love to make more progress on as well: automatically preventing concurrency issues from causing harm.

We are looking for someone to join our team!

If you are interested in programming language implementation and concurrency, please reach out! We have a two-year postdoc position here at Kent in the PLAS group, and you would join Sophie, Octave, and me to work on interesting research. In the project, we’ll continue to collaborate also with Prof. Gonzalez Boix in Brussels (Belgium), Prof. Mössenböck in Linz (Austria), and the GraalVM team of Oracle Labs, which includes the opportunity for research visits.

Our team is well connect for instance also with Shopify, which supports a project on improving warmup and interpreter performance of GraalVM languages.

Again, please reach out via email or perhaps via Twitter.

Interpreter Generators: A Brief Look at Existing Work

Motivated by Tiger, a tool for generating interpreters, being mentioned on Twitter, I had a brief look at vmgen, Tiger, eJSTK, Truffle DSL, and DynSem. What follows are my rather rough notes and pointers. So, this is by no means a careful literature study, and I welcome further pointers.

vmgen: A DSL for Interpreters

vmgen [1, 2] might be one of the seminal projects on interpreter research. It has been used to evaluate optimizations such as superinstructions and stack caching.

It provides a domain-specific language (DSL) to specify the semantics of bytecodes as operations on a stack. These specifications are then used to generate the interpreter, possibly as threaded code interpreter, and allow vmgen to apply optimizations to the bytecodes or construct superinstructions.

The instruction definition for an integer addition looks something like this (see [1]):

iadd (i1 i2 -- i)
i = i1 + i2;

The first line encodes the stack operations, and says that two integers are popped of the stack, and one is return as result. The variable names can then be used in the second line to define the actual operation using C code. The language also supports defining register machines and superinstructions.

Overall the DSL is a relatively thin layer over the C target code, which gives a lot of control while enabling a high degree of automation.

Tiger: vmgen 2.0

Tiger is the direct successor of vmgen, and was used for instance to evaluate the benefits of static and dynamic interpreter code replication to help branch predictors.

From a language perspective, it looks like an extension of vmgen:

ADD SP( int a, int b - int c )
    IP( - next);
    c=a+b;

The main difference in the above example is that stack (SP) and instruction stream (IP) behavior are made more explicit. In addition to vmgen’s support for superinstructions, Tiger supports instruction specialization:

+SPECIAL PUSHINT 0;

This would create a specialized instruction that pushes the 0 integer value. This can be used together with the superinstruction support, for instance to create an instruction to increment a value:

PUSHINT1_ADD = PUSHINT 1 ADD;

To utilize the processors pipelining and out-of-order execution, Tiger allows to load the address for the next instruction early. It also supports to defer read/write operations for instance if they are only needed in one of the branches. This gives Tiger a few more low-level features to have more control over what the generated interpreter is doing and how it performs on a CPU.

The most complete overview is likely to be found in K. Casey’s dissertation [3, 4], which also has a good overview of interpreter optimizations.

Truffle DSL: A DSL for Self-Optimizing Interpreters

Skipping ahead a decade, Truffle DSL [5, 6] seems the most related bit of work I am aware of. It’s quite different from vmgen and Tiger in that it’s a DSL for self-optimizing interpreters and focuses specifically on the self-optimizing aspect. It is part of the GraalVM project and used in combination with a system that provides just-in-time compilation based on partial evaluation for the languages built using the Truffle framework. Thus, one of the key differences is that it is not targeted at solving the low-level performance issues of bytecode-like interpreters. Instead, it provides mechanisms to conveniently express a language’s semantics and allow the definition of the various special cases dynamic languages have for even basic operations such as addition.

The classic example is the addition operation:

@NodeChild("left")
@NodeChild("right")
abstract class AddNode extends Node {
  @Specialization
  int addIntInt(int left, int right) {
    return left + right;
  }

  @Specialization
  String addStringInt(String left, int right) {
    return left + right;
  }
}

The above class represents an addition node. The Truffle DSL generates a concrete Java class to implement the abstract one, providing the glue code to manage the defined @Specialization methods. It checks that the arguments for the addition, and depending on the seen types, will activate the appropriate one. In this case both arguments are either expected to be integers, or the left one a string. When the left one is indeed a string, then the addStringInt() method is executed. The interpreter keeps track of which specializations were used for a specific lexical location to guide the just-in-time compilation and enables it to reach peak performance competitive with state-of-the-art JIT compiling VMs.

The DSL supports also expression guards, to not just rely on types but for instance check values to distinguish different special cases. It also supports the notion of assumptions, a mechanism that acts globally and allows to invalidate code that is not needed anymore, perhaps because it is inconsistent with some global property, perhaps that no subclass of a specific class is loaded.

To minimize interpreter overhead, it also supports boxing elimination for specializations by generating type-specialized code paths. This allows for instance the addIntInt() method to directly get unboxed integers and returns an unboxed integer, which avoids unboxing and reboxing, a common problem for interpreters with uniform data representations.

In addition, the Truffle framework comes with various other elements, for instance frame objects that type specialize, an object storage model for the same purpose, and so-called Truffle libraries, which is a mechanism to generate glue code for custom data structures with specializing representations.

eJSTK: A DSL for Type Specializations

eJSTK [7], the embedded JavaScript Tool Kit, is an approach to generate custom JavaScript virtual machines (incl. the interpreters), specialized to a specific program. Dynamic languages like JavaScript have complex semantics, and even seemingly simple operations such as addition require a lot of code to be implemented for the general case. eJSTK’s goal is to identify which semantics a program actually needs to reduce the interpreter to the minimum necessary implementation and fit it even on extremely resource constraint systems, by minimizing the binary size of the interpreter.

The main idea to achieve this is to specialize the interpreter based on the types and operations needed for the execution of a program. To enable this specialization, it defines the bytecodes in terms of specializations similar to the Truffle DSL. Though, instead of generating Java code from it, eJSTK generates a C interpreter.

\inst add (Register dst, Value v1, Value v2)
\when v1:fixnum && v2:fixnum \{
  int s = int_from_fixnum(v1) + int_from_fixnum(v2);
  dst = int_to_number(s);
\}
\when v1:string && (v2:fixnum || v2:flonum || v2:special) \{
  v2 = to_string(context, v2);
  goto add_STRSTR;
\}
\when v1:string && v2:string \{
  add_STRSTR:
  dst = concat_to_string(context, cstr_from_string(v1), cstr_from_string(v2));
\}
// ... omitted various cases
\otherwise \{
  double x1 = convert_to_double(context, v1);
  double x2 = convert_to_double(context, v2);
  dst = double_to_number(x1 + x2);
\}

The above example is adapted from [7] and illustrates the type-based specializations. The interpreter generator uses it together with information of which types a JavaScript program will use at run time to only generate the minimum necessary set of specializations, construct decision trees for type tests, and possibly also generate superinstructions [8].

Compared to the Truffle DSL, it’s more low-level, in spirit closer to vmgen and Tiger, but also has fewer features around interpreter optimizations.

DynSem: Specifying Dynamic Semantics of Languages

Somewhat related, but with a much broader scope is DynSem [1, 2]. Part of the Spoofax language workbench, it is a language to define the whole semantics of a language in terms of a term-rewriting system. Thus, in a type of formal semantics. Not sure what the current state of the project is, but in the past, the specification itself was executed by a DynSem interpreter. Though, the authors talked also about compiling the specification down to a Truffle interpreter specific to the language specified in DynSem.

E ⊦ e1 :: H1 ⟶ IntV(i) :: H2;
E ⊦ e2 :: H2 ⟶ IntV(j) :: H3;
IntV(addI(i, j))⇒ v
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
E ⊦ Plus(e1, e2) :: H1 ⟶ v :: H3

The above defines the addition semantics by roughly saying if there’s a Plus(e1, e2) in a program, it will evaluate to a value v. Though, it requires that e1 and e2 both evaluate to an integer value IntV, which then can be added using the addI() function.

Arguable, DynSem is on another abstraction level to the other things mentioned here. Though, I suppose in the future, it would be nice to connect these things and get really fast interpreters from a definition possibly as high-level as DynSem.

Conclusion

vmgen and Tiger are fairly low-level and focus on classic interpreter optimizations. Though, for dynamic languages such as JavaScript, Ruby, and Python, one benefits likely from specifying the different cases for an operation based on for instance the types of arguments as we see with Truffle DSL and eJSTK. There’s a huge gap to DynSem, and I am not sure it will be bridged any time soon. Though, before than, it might be beneficial to bring some of the aspects of vmgen, Tiger, and eJSTK into a setting like Truffle DSL to benefit from optimizations such as superinstructions or even supernodes.

If you have any comments, suggestions, or pointers, would be great to have a discussion on Twitter.

Open Postdoc Position on Language Implementation and Concurrency

King Arthur watching over project Camelot.

We have an open Postdoc position here in the Programming Languages and Systems group at Kent.

It’s a 2 year position in our CaMELot research project, where we want to “Catch and Mitigate Event-Loop Concurrency Issues”.

Even so event loops prevent various low-level concurrency issues, we are still left with many high-level ones [1]. But, there are various mechanisms that allow us to prevent concurrency bugs from causing harm [2, 3, 4] even though they may have slipped into our software.

The problem with current approaches is that they typically have high run-time overhead. The techniques for detecting concurrency issues such as race conditions and message interleaving issues can slow down systems too much, making them impractical.

And we want to change that by creatively borrowing language implementation ideas [5, 6].

If this sounds exciting to you, please reach out, and apply for our position. We’re a pretty friendly and supportive bunch here at Kent and hope this position can be your stepping stone whether you want to stay in academia or perhaps go into industrial research.

If you have any questions, wonder whether you qualify, or are simply curious, please send me an email or a message @smarr on Twitter.

Older Posts

Subscribe via RSS