Over the course of the next four weeks, I plan to publish a new post every Tuesday to give a detailed introduction on how to use the Graal compiler and the Truffle framework to build fast languages. And this is the very first post to setup this series. The next posts are going to provide a bit of background on Golo, the language we are experimenting with, then build up the basic interpreter for executing a simple Fibonacci and later a Mandelbrot computation. To round off the series, we will also discuss how to use one of the tools that come with Graal to optimize the performance of an interpreter. But for today, let’s start with the basics.

What Is Graal JIT Compilation and How Can It Be Useful?

Graal is a just-in-time compiler for the Java Virtual Machine (JVM), written in Java. In combination with the Truffle framework, one can implement simple abstract-syntax-tree (AST) interpreters that can automatically use the Graal compiler. This means, one only has to implement the language once as an interpreter, and essentially gets a JIT compiler for free that can reach the performance of state-of-the-art virtual machines such as HotSpot for Java or V8 for JavaScript. Thus, Graal is useful to reach good performance without building a custom virtual machine.

An alternative approach would be RPython with its meta-tracing just-in-time compiler. A more thorough comparison of the two approaches can be found here and here. However, for this series of blog posts, we are interested in a language running on top of the JVM, and thus, Truffle is our framework of choice.

Step 1: Setting A Goal and Choose A Benchmark

Truffle provides a nice set of features to simplify the implementation of languages. However, we want to focus here on using Graal and Truffle to improve the performance of a language implementation. So, let’s look into the following example from this perspective.

1. The Numerical Performance of Golo

Our language of choice is going to be Golo. Golo has been designed as a dynamic language for the JVM that benefits from the JVM’s invokedynamic mechanism to provide an expressive yet simple language experience. If you haven’t heard of it before, don’t worry. We will stick to the basics and focus on features that are common to many languages. From some basic benchmarks, we know that Golo is still quite a bit slower on numeric code than Java.

2. Mandelbrot as a Benchmark

To keep it simple, but still choose something that can be used to investigate performance, we use a Mandelbrot calculation as our main example. The following code snippet is an abbreviated version of the Java benchmark we will use:

while (y < size) {
  double ci = (2.0 * y / size) - 1.0;
  int x = 0;

  while (x < size) {
    // ...
    double cr = (2.0 * x / size) - 1.5;

    int z = 0;
    int escape = 1;

    while (z < 50) {
      zr = zrzr - zizi + cr;
      zi = 2.0 * zr * zi + ci;

      zrzr = zr*zr;
      zizi = zi*zi;

      if (zrzr + zizi > 4.0) {
        escape = 0;
        break;
      }
      z += 1;
    }

    byte_acc = (byte_acc << 1) | escape;
    bit_num += 1;
// remainder left out for brevity

In this fragment, we got three nested loops that work mostly on double values. For a dynamic language, this means that it is essential for reaching good performance to optimize the lookup of operators and methods as well as the way double values are treated.

In dynamic languages, the concrete operation behind the multiplication ‘*’ the division ‘/’ or a less-than ‘<’ comparison operator is typically known only at runtime. For optimal performance a language implementation needs to ensure that that these operators are not looked up for every single operation, even if the general program semantics do not give any guarantees about the values they might operate on. Furthermore, sophisticated optimizations need to be able to move the processor instructions related to such lookups out of loops. Otherwise, the simple comparison in a loop condition could incur a huge overhead.

Similarly, double values are often treated as objects. This usually means that they have a boxed representation. Thus, an object is allocated to hold the underlying primitive value of the double. This however means, that an object needs to be allocated. Since each operation potentially produces a new double object, this could mean a lot of work for the garbage collector. Another drawback is that the actual double value has to be unboxed before it can be processed by processor instructions. This means that there is at least a dereference operation for the value before the actual computation happens.

In future parts of this series of blog posts, we will see how these issues can be approached with Graal and Truffle.

3. Initial Performance Comparison

To get an initial idea of the performance difference between Golo and Java on our Mandelbrot benchmark, we are going to measure the peak performance. For that, we execute the benchmark on each VM 100 times and measure the execution time for the last 10 iterations.

Comparing Performance of Golo and Java on Mandelbrot Benchmark

As result, we see that Mandelbrot takes about 108ms to execute on Java and about 512ms on Golo, which means, Golo is about 4.7x slower for our Mandelbrot benchmark, leaving plenty of room to increase Golo’s performance.

4. Next Time: A First Look at Golo

Before we start adding Graal support, the next part of this blog post series looks briefly into the current Golo implementation. It compiles Golo directly to JVM bytecodes and relies heavily on invokedynamic to realize the dynamic language semantics. Interestingly, Golo does not have bit operations. We’ll use that as an excuse to get a basic idea of how the implementation works and then can compare it later to the Graal+Truffle version.

Acknowledgements

I’d like to thank Julien Ponge for his comments on drafts of this tutorial and his help with the technical details of Golo. And I’d also like to thank Benoit and Thomas F. from the SSW for their comments.

Creative Commons License
“Add Graal JIT Compilation to Your JVM Language in 5 Easy Steps” by Stefan Marr is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Permissions beyond the scope of this license are available on request.