Type Hierarchies and Guards in Truffle Languages

Continuing a little bit with writing notes on Truffle and Graal, this one is based on my observations in SOMns and changes to its message dispatch mechanism. Specifically, I refactored the main message dispatch chain in SOMns. As in Self and Newspeak, all interactions with objects are message sends. Thus, field access and method invocation is essentially the same. This means that message sending is a key to good performance.

In my previous design, I structured the dispatch chain in a way that, I thought, I'd reduce the necessary runtime checks. This design decision still came from TruffleSOM where the class hierarchy was much simpler and it still seems to work.

My naive design distinguished two different cases. One case is that the receiver is a standard Java objects, for instance boxed primitives such as longs and doubles, or other Java objects that is used directly. The second case is objects from my own hierarchy of Smalltalk objects under SAbstractObject.

The hierarchy is a little more involved, it includes the abstract class, a class for objects that have a Smalltalk class SObjectWithClass, a class for objects without fields, for objects with fields, and that one is then again subclassed by classes for mutable and immutable objects. There are still a few more details to it, but I think you get the idea.

So, with that, I thought, let's structure the dispatch chain like this, starting with a message send node as its root:

MsgSend
  -> JavaRcvr
  -> JavaRcvr
  -> CheckIsSOMObject
        \-> UninitializedJavaRcvr
  -> SOMRcvr
  -> SOMRcvr
  -> UninitializedSOMRcvr

This represents a dispatch chain for a message send site that has seen four different receivers, two primitive types, and two Smalltalk types. This could be the case for instance for the polymorphic + message.

The main idea was to split the chain in two parts so that I avoid checking for the SOM object more than once, and then can just cast the receiver to SObjectWithClass in the second part of the chain to be able to read the Smalltalk class from it.

Now it turns out, this is not the best idea. The main problem is that SObjectWithClass is not a leaf class in my SOMns hierarchy (this is the case in TruffleSOM though, where it originates). This means, at runtime, the check, i.e., the guard for SObjectWithClass can be expensive. When I looked at the compilation in IGV, I saw many instanceof checks that could not be removed and resulted in runtime traversal of the class hierarchy, to confirm that a specific concrete class was indeed a subclass of SObjectWithClass.

In order to avoid these expensive checks, I refactored the dispatch nodes to extract the guard into its own node that does only the minimal amount of work for each specific case. And it only ever checks for the specific leaf class of my hierarchy that is expected for a specific receiver.

This also means, the new dispatch chain is not separated in parts anymore as it was before. Instead, the nodes are simply added in the order in which the different receiver types are observed over time.

Overall the performance impact is rather large. I saw on the Richards benchmark a gain of 10% and on DeltaBlue about 20%. Unfortunately my refactoring also changed a few other details beside the changes related to instanceof and casts. It also made the guards for objects with fields depend on the object layout instead of the class, which avoids having multiple guards for essentially the same constraint further down the road.

So, the main take-away here is that the choice of guard types can have a major performance impact. I also had a couple of other @Specialization nodes that were using non-leaf classes. For instance like this: @Specialization public Object doSOMObject(SObjectWithClass rcvr) {...}

This looks inconspicuous at first, but fixing those and a few other things resulted in overall runtime reduction on multiple benchmarks between 20% and 30%.

A good way to find these issues is to see in IGV that instanceof or checked cast snippets are inlined and not completely removed. Often they are already visible in the list of phases when the snippets are resolved. Another way to identify them is the use of the Graal option -Dgraal.option.TraceTrufflePerformanceWarnings=true (I guess that would be -G:+TraceTrufflePerformanceWarnings when mx is used). The output names the specific non-leaf node checks that have been found in the graph. Not all of them are critical, because they can be removed by later phases. To check that, you can use the id of the node from the output and search for it in the corresponding IGV graph using for instance id=3235 in the search field.

Add Graal JIT Compilation to Your JVM Language in 5 Easy Steps, Step 5

Step 5: Optimizing the Interpreter for Compilation

In the previous post of this series, we completed the support for executing Mandelbrot and saw that our interpreter reaches with the help of the Graal compiler the same performance as Golo's bytecode-based implementation.

In this post, we first introduce how the Graal compiler works for our interpreter, and then we are going to use IGV, a viewer for Graal's compilation graphs, to identify a performance bottleneck and optimize it.

Introduction to Graal's Compilation Approach

Conceptually, Graal is a pretty classic compiler. It works on a input program, applies all kind of optimizations structured in different compiler phases and based on that generates native code for the execution.

The way we use Graal is however a little unusual. We use it as a meta-compiler. So, instead of applying it to a specific input program, we apply it to an interpreter that executes a specific program. So, Graal is not specific to a language, but can compile any language that is implemented as a Truffle interpreter.

The compilation works very similar to many other just-in-time compilers, i.e., compilers that are executed at runtime based on a program's behavior. Graal will consider the compilation of methods for our Golo program once a certain execution threshold is reached. This usually means, when a Golo method was executed 1000 times, Graal will start compiling and optimizing it.

For our benchmark, it eventually compiles the mandelbrot(size) method that we saw in the previous posts. This method is represented as AST and Graal knows that the relevant entry point is the execute(.) method of the RootNode. So, it takes the AST and the code of the execute(.) method to start compiling. In the first step it partially evaluates the code of this method. In that process it will reach for instance field reads from AST nodes or method calls on these nodes. In these cases it can fill in the blanks based on the concrete AST it got. And since our AST already specialized itself to contain the most specific operations possible, this process works very well. In the end, this partial evaluation constructs the compilation unit. This means it gathers, some say inlines, all code that is reachable from the initial execute(.) method. This is a greedy process, so it inlines as much as possible.11 In some cases this is problematic. For more control, the @TruffleBoundary annotation can be used to stop inlining. The goal is to know as much about the execution of this method as possible. When it finds calls to other Golo functions, it uses a heuristic to decide whether to inline these, too. This means we end up with a large set of code that describes which operations our program performs and on which we can apply classic compiler optimizations. Generally, this process is pretty successful in removing all the overhead of the interpreter and yield native code that runs fast.

However, as we have seen in the previous post, we did still have an overhead of about 5x over Java, which is not really impressive. So, let's see why this is the case.

Setting up Graal and Golo

So far, we have not discussed the more practical issues of how to obtain the Graal compiler, it's tool, and Golo. Let's rectify that now. To investigate our performance issues, we need IGV. It is also known as the Ideal Graph Visualizer and is currently maintained as part of the Graal compiler. And we need of course the Golo implementation. In case you are more interested in the later discussions than following along yourself, feel free to skip to the next section.

First, we want the GraalVM binaries. The GraalVM is a release of the Graal compiler together with Oracle's language implementations for JavaScript, R, and Ruby. It can be downloaded from the Oracle Technology Network.

Unfortunately, IGV is not part of this release, so, we also need the source repository. This is a little more involved. We need the mx tool that is used for obtaining dependencies, building, and executing the development builds of Graal. As many OpenJDK projects, this project also uses Mercurial as version control system.

The following steps should result in a usable setup:

mkdir graal-ws; cd graal-ws             # create and change to a working folder
hg clone https://bitbucket.org/allr/mx  # get mx
./mx/mx sclone http://hg.openjdk.java.net/graal/graal-compiler  # get graal

Note, this can take quite a while. The repositories are rather large.

In the meanwhile, we can already get and compile the Golo branch with support for Mandelbrot. Use git to checkout the truffle/mandelbrot branch in the graal-ws folder. And then, use the provided Gradle script to compile everything:

git clone -b truffle/mandelbrot https://github.com/smarr/golo-lang.git
cd golo-lang
./gradlew instDist  # warning: this will download all dependencies

After we obtained the above mentioned GraalVM binary, the source repository, and Golo, we are all set for investigating the performance of our Golo interpreter.

How to use IGV to Inspect Compilation Results

The next step is to run our benchmark. Assuming that the GraalVM release was put into the graal-ws folder, we should be able to execute the following:

../GraalVM-0.9/bin/java -classpath build/install/golo/lib/golo-3.0.0-incubation-SNAPSHOT.jar:build/install/golo/lib/asm-5.0.4.jar:build/install/golo/lib/jcommander-1.48.jar:build/install/golo/lib/txtmark-0.13.jar:build/install/golo/lib/json-simple-1.1.1.jar fr.insalyon.citi.golo.cli.Main golo --truffle --files samples/mandelbrot.golo

This is the full command to run the Truffle interpreter on the GraalVM. Thus, it includes the classpath with all runtime dependencies, as well as the Golo command with the --truffle option to execute the samples/mandelbrot.golo file.

The output of the resulting execution should show that the Mandelbrot benchmark is executed 100 times.

Run IGV, Run Benchmark with IGV Output

At this point, we should be able to execute the Mandelbrot benchmark. Now, we are looking into understanding what it is doing. For that we use IGV. To start it, we need a new terminal and change to the graal-compiler folder, and start it using mx:

cd graal-compiler
../mx/mx igv

It might take a bit, but eventually a Java desktop application should start up. Once it is running, we can run the benchmark again, but this time asking it to dump all of the compiler information, which is then displayed by IGV:

../GraalVM-0.9/bin/java -Djvmci.option.Dump=Truffle,TruffleTree \
  -classpath build/install/golo/lib/golo-3.0.0-incubation-SNAPSHOT.jar:build/install/golo/lib/asm-5.0.4.jar:build/install/golo/lib/jcommander-1.48.jar:build/install/golo/lib/txtmark-0.13.jar:build/install/golo/lib/json-simple-1.1.1.jar \
  fr.insalyon.citi.golo.cli.Main golo --truffle --files samples/mandelbrot.golo

After completion, IGV should roughly the following:

IGV showing the compiled methods in the outline

In the outline pane on the left, we see six folders, which correspond to three compilation requests Graal received during the execution of the benchmark. At the top of the list, we see two compilations named RepeatNode. These correspond to loop bodies in the mandelbrot(.) method and the folder contain the concrete AST that was compiled. Then we see two entries from TruffleCompilerThreads, that are also about the RepeatNode. These folders contain the compiler graph. We will ignore all four of these folders and focus on the last two ones.

The last two folder correspond to the whole mandelbrot(.) method. The folder name Function@... is contains the AST and the folder from the TruffleCompilerThread contains the compiler graph.

Inspecting the AST

Let's first have a brief look at the AST that was compiled:

IGV showing the AST of the mandelbrot() method.

The screenshot here shows the satellite view (icon of magnifying glass in the middle). When we are in the normal view, we can navigate over the method, and inspect nodes. The right hand side of IGV should show a property window with details to the nodes. For instance the LocalVariable nodes contain the details on the slot object they hold, and thereby the variable names.

In general, each AST node shows it subnodes/child nodes. So, the LocalVariableWriteNodes refer to the value expression that returns the value to be written. At the beginning of the method that corresponds to the literals defined in the program.

When looking for instance at the LessThanNode, we see an IntegersNode_:

LessThanNode in IGV, showing specialization for Integers

This IntegersNode_ is generated by the TruffleDSL for the case that all arguments are int values. Thus, it corresponds to the @Specialization we implemented previously. Note also the UninitializedNode_ below it. In case the arguments should change and for instance lead to a comparison of floating point numbers, the DSL provides us with the support to handle this kind of polymorphism. Exploring the graph further would reveal that the mandelbrot(.) method is very well behaved and all operations only require a single specialization. So, it is ideal for compilation.

In general, a look at the Truffle AST is useful to get a very high-level overview and to confirm initial assumptions about what gets compiled. As we will see in the next step, this high-level view is unfortunately not preserved when looking at the compiler graphs.

Inspecting the Graal Compilation

When we open the last folder named TruffleCompilerThread..., we see two folders and one graph in-between the two. The first folder contains the details of how the partial evaluation of the AST is performed.

In the screenshot below, we see the initial method that was sent to compilation:

Graal's compilation Graph at the first step

To make the graph a little more readable, we select in the filter pane on the right the option of Coloring, Remove State, and Reduce Edges. Since those graphs are typically very complex, these simplifications help to get an overview of where is what.

In the graph, we see now the Start node and then various other ones. The red edges indicate control flow between those nodes, and the blue and black edges correspond to data dependencies. What we can read from this graph is that Truffle actually does argument type profiling for all Truffle methods. This is a very generic and usually beneficial optimization. Since we pass only untyped Object arrays, the additional type information helps the optimizer by providing the missing information.

At the bottom of the screenshot, we further see a red box that corresponds to the invoke of the isValid() method. Such method calls will typically be inlined during this partial evaluation process.

Graph after completing the Partial Evaluation Phase

The screenshot above shows the whole graph after the partial evaluation phase. It ended up being quite complex, and the nodes are way too small to be distinguishable, but with a little practice, one can spot the loops.

So, let's have a look at the second folder. This one contains the output for the various optimization passes applied by Graal. The list looks roughly something like this:

 0: initial state
 1: Canonicalizer
 2: Inlining
 3: DeadCodeElimination
 4: Canonicalizer
...
17: DominatorConditionalElimination
18: IterativeConditionalElimination
...
31: LoopFullUnroll
...
36: After inlining snippet HotSpotMethod<BoxingSnippets.doubleValue(Double)>
37: After lowering -100...|Unbox...
...
120: After inlining snippet HotSpotMethod<BoxingSnippets.intValueOf(int)>
121: After lowering -100...|Box...
...
...
186: After inlining snippet HotSpotMethod<NewObjectSnippets.allocateInstance(...)>
187: After lowering -100...|NewInstance...
...

The names of the phases give hints of what the change should be with respect to the previous graph. It includes further inlining, dead code elimination, loop unrolling, and many other classic compiler optimizations.

One thing that sticks out here however are almost a hundred passes related to boxing and unboxing of integer and double values. And later, we see another 60 or so passes that relate to object instantiation. This looks very strange. Our benchmark should mostly work on primitive numbers, here we really would not expect boxing or allocation going on. After all, Mandelbrot is a numerical benchmark. So, let's investigate that closer.

I chose to open the phase 28, the result after running the Canonicalizer. This is very early in the process, and the graph should now be in a normalized form. Browsing a little over it, trying to see where all those boxing operations come from, I chose to investigate some Box operation somewhere in the middle. I picked the one that has the number 11510 in my graph. Double clicking on it hides the whole rest of the graph in shows only the remaining siblings. From here I started to explore the context of this boxing operation. As a parent node, we see a subtraction operation, with the constant 1.5 as an input, as well as a division as input that itself got a multiplication with the constant 2.0 as input. The other input to the division is an unbox operation. This all does not look very promising.

Graph showing the context of a boxing operation

With those fragments of information, I'd say this piece of the graph corresponds to the following line of our mandelbrot(.) function:

var cr = (2.0 * doubleValue(x) / size) - 1.5

With that in mind, we can verify that the unbox operation corresponds to the reading of size, which is the function's argument, and thus, the value is stored as an object in an array. And, exploring the graph, we see the corresponding LoadIndexed at index 0. Ok, so, that's expected, and we can't really do something about that.

Now, why is the boxing operation there? A Phi node depends on the result of it, and that itself goes into a value proxy node, which seems to have another dependency of another Phi node, which then again is also the input for the Phi node that has the boxing operation a dependency.

What we can't see here at this stage in the graph is that the operations on the frame object, i.e., reading and writing of local variables has been converted to pure data dependencies. So, that ideally, there are not actual store operations but operations can consume the temporary results directly.

At this point, I would venture the guess that what we see here is the assignment to the cr variable. Considering the following implementation for local variable assignment, the result looks plausible:

abstract class LocalVariableWriteNode extends ExpressionNode {
  protected final FrameSlot slot;
  public LocalVariableWriteNode(FrameSlot slot) { this.slot = slot; }

  @Specialization
  public Object writeGeneric(VirtualFrame frame, Object expValue) {
    ensureObjectKind();
    frame.setObject(slot, expValue);
    return expValue;
  }
// ...

Currently, our local variables only consider objects. Unfortunately, Graal cannot automatically remove the boxing operations for us, so, we need to optimize this manually.

Type-Specialization of Access to Local Variables

To avoid the boxing, we want to store primitive values directly into the frame objects. We'll do this by providing additional specializations for the LocalVariableWriteNode and LocalVariableReadNode classes.

Let's start with writing to variables. We need to add new specializations to the class. Depending on the type of the value that the expr returns, we can use the frame's type-specific setters:

@NodeChild(value = "expr", type = ExpressionNode.class)
abstract class LocalVariableWriteNode extends ExpressionNode {
  protected final FrameSlot slot;
  public LocalVariableWriteNode(FrameSlot slot) { this.slot = slot; }

  @Specialization(guards = "isIntKind()")
  public int writeInt(VirtualFrame frame, int expValue) {
    frame.setInt(slot, expValue);
    return expValue;
  }

  @Specialization(guards = "isDoubleKind()")
  public double writeDouble(VirtualFrame frame, double expValue) {
    frame.setDouble(slot, expValue);
    return expValue;
  }

  @Specialization(contains = {"writeInt", "writeDouble"})
  public Object writeGeneric(VirtualFrame frame, Object expValue) {
    slot.setKind(FrameSlotKind.Object);
    frame.setObject(slot, expValue);
    return expValue;
  }

  protected final boolean isIntKind() {
    if (slot.getKind() == FrameSlotKind.Int) { return true; }
    if (slot.getKind() == FrameSlotKind.Illegal) {
      slot.setKind(FrameSlotKind.Int);
      return true;
    }
    return false;
  }

  protected final boolean isDoubleKind() {
    if (slot.getKind() == FrameSlotKind.Double) { return true; }
    if (slot.getKind() == FrameSlotKind.Illegal) {
      slot.setKind(FrameSlotKind.Double);
      return true;
    }
    return false;
  }
}

We added two new specializations for writing int and double values to the frame. The specialization are based on the implicit check of the result type of the expValue evaluation, which is implied by the method signatures, as well as the defined guards. Since other AST nodes can cause the type of a variable to change, we need to guard a specialization based on the recorded type in the frame slot. Generally, the idea is that we can use a specialization if the slot has either the right type or is not yet initialized. The slot is uninitialized if its kind is Illegal.

Since we now have three different specializations, we need to think about their relation at runtime. By designing the guard so that it fails if unexpected type is found, we make sure that we go to the most general case writeGeneric. This avoids changing back and forth between specialization, which would prevent compilation. Furthermore, we tell the writeGeneric specialization that it contains the two more specific ones. This allows the DSL to generate the right code, so that it really uses only the writeGeneric version and remove previously used specializations to avoid runtime overhead.

The read node is a little simpler:

abstract class LocalVariableReadNode extends ExpressionNode {
  protected final FrameSlot slot;

  public LocalVariableReadNode(FrameSlot slot) { this.slot = slot; }

  @Specialization(guards = "isUninitialized()")
  public Object doNull() { return null; }

  @Specialization(guards = "isInitialized()",
                  rewriteOn = {FrameSlotTypeException.class})
  public int doInt(VirtualFrame frame) throws FrameSlotTypeException {
    return frame.getInt(slot);
  }

  @Specialization(guards = "isInitialized()",
                  rewriteOn = {FrameSlotTypeException.class})
  public double doDouble(VirtualFrame frame) throws FrameSlotTypeException {
    return frame.getDouble(slot);
  }

  @Specialization(guards = "isInitialized()")
  public Object doObject(final VirtualFrame frame) {
    return frame.getValue(slot);
  }

  protected boolean isInitialized() {
    return slot.getKind() != FrameSlotKind.Illegal;
  }

  protected boolean isUninitialized() {
    return slot.getKind() == FrameSlotKind.Illegal;
  }
}

The first specialization uses the slot kind to specialize for the case that a read is done on a variable that has not yet been assigned a value, thus, it returns null. The remaining three specializations are for the int, double, or finally the generic Object case. In the interpreter this means, we first try to read from a slot as int, if that fails, we catch the FrameSlotTypeException and go to the next specialization. In this case, this is the doDouble specialization. If that one fails as well, we go to the last specialization that will always succeed by reading a boxed object.

Measuring the Results

With this optimization in place, let's see what it gives. We use the same details for execution as earlier and get the following results:

Final performance of Golo+Graal JIT compilation

This time around, Golo+Graal is about as fast as Java. It is roughly 18% slower than Java, but about 4x faster than Golo with its bytecode-based backend on top of the Hotspot JVM. With a little bit more optimizing, we would probably be able to squeeze a little bit more out of Golo+Graal, but, let's call it fast enough for now.

Summary

In this post, we discussed how Graal compiles our interpreter, how to use IGV to get a better understanding of what is going on, and how we can optimize the access to local variables to avoid excessive overhead because of boxing. With the optimization, we reached Java's speed within 20% and are 4x faster than Golo was before. Overall, a pretty nice result.

However, we had to built a new interpreter from scratch, which took quite a bit of time and code. And, the implementation is nowhere near completion. To execute Golo programs, we would still need to implement a large number of things. For instance, how to support its dynamic objects based on the Truffle object model, how to use the @Cached support of the DSL for inline caches, perhaps how to optimize tail-calls to encourage a more functional programming style, etc. Plenty of things left to discuss. But for now, this post completes this series. I hope you found it useful.

If you start with your on project, drop by the Graal mailing list or check the archives. We will try to answer your questions. Truffle looks somewhat simple in the beginning, but my experience is that it takes quite a bit to get used to thinking in the way the interpreter and compilation interact to get great results.

Truffle Concepts

To close off this post, a brief list of the introduced concepts:

  • @TruffleBoundary asks Graal to stop inlining here during the partial evaluation pass. This is necessary for instance for recursive code, when the termination condition is not based on a compile-time constant.
  • Guards encode conditions for when a specialization is applicable.
  • contains denotes specializations that are subsumed by a more generic specialization.
  • @Cached is an annotation to support custom inline caches in specializations.

Acknowledgements

I'd like to thank Julien Ponge for his comments on drafts of this tutorial and his help with the technical details of Golo. And thanks to Gilles Duboscq for spotting typos.

Creative Commons License
“Add Graal JIT Compilation to Your JVM Language in 5 Easy Steps” by Stefan Marr is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Permissions beyond the scope of this license are available on request.

Add Graal JIT Compilation to Your JVM Language in 5 Easy Steps, Step 4

Step 4: Complete Support for Mandelbrot

In the previous post of this series, we built up all the infrastructure to execute a simple Fibonacci function with a Truffle interpreter for Golo. This included an introduction to the basic aspects of Truffle, its support for specializations, the idiomatic ways for realizing sequences, control flow, basic operators, and literals. Together with the discussion of how function invocation can be implemented, we covered most aspects that are required also for our final goal, i.e., to support the execution of the Mandelbrot program from the very first post in the series.

Below, the main part of the Mandelbrot program implemented in Java:

while (y < size) {
  double ci = (2.0 * y / size) - 1.0;
  int x = 0;

  while (x < size) {
    // ...
    double cr = (2.0 * x / size) - 1.5;

    int z = 0;
    int escape = 1;

    while (z < 50) {
      zr = zrzr - zizi + cr;
      zi = 2.0 * zr * zi + ci;

      zrzr = zr*zr;
      zizi = zi*zi;

      if (zrzr + zizi > 4.0) {
        escape = 0;
        break;
      }
      z += 1;
    }

    byte_acc = (byte_acc << 1) | escape;
    bit_num += 1;
// remainder left out for brevity

Since much of what we needed to support Mandelbrot are extensions, or based on the same principles that we discussed before, this post will be much shorter and only sketch the changes to existing elements. However, we discuss in greater detail how to add local variables, loops, and how to call Java functions based on method handles.

Extensions for Longs, Doubles, Strings, and Basic Operations

In a first step, we extend the Types class and the ExpressionNode to support long and double primitives as well as Strings for specialization. The Types class looks now like this:

@TypeSystem({
  int.class,
  boolean.class,
  long.class,    // added for Mandelbrot
  double.class,  // added for Mandelbrot
  String.class,  // added for Mandelbrot
  Object[].class
})
class Types { }

In the ExpressionNode, we merely add the missing executeLong(.), executeDouble(.), and executeString(.) methods in the same way as the existing ones. This will then allow a node to speculate on the child nodes always evaluating, for instance, to a primitive long value.

As we can see in the Mandelbrot program above, we also need support for literals. Specifically, we are going to add nodes for literal long, double, and string values. Furthermore, we add nodes for the true, false, and null literals. With the IntegerLiteralNode from the last post, we already saw the general idea. For true, false, and null it is even simpler, we merely return the fixed value on execution. So, it looks like this:

class NullLiteralNode extends LiteralNode {

  public Object executeGeneric(VirtualFrame frame) {
    return null;
  }
}

Since Mandelbrot is mostly about computation, we also need a range of additions and extensions to support multiplication, division, addition, subtraction, less-than, equal as well as not-equal. Furthermore, we need to implement the Truffle versions of the xor, or, and left shift bit operators.

Below find a sketch of the addition as implemented by the PlusNode. The sketch shows the support for doubles and strings. Note that is also has a doStringAndDouble(.,.) method. This is because the child nodes have of course different semantics so that we need to specify the desired behavior for all combinations as well.

abstract class PlusNode extends BinaryNode {

  @Specialization
  public String doStrings(String left, String right) {
    return left + right;
  }

  @Specialization
  public double doDoubles(double left, double right) {
    return left + right;
  }

  @Specialization
  public String doStringAndDouble(String left, double right) {
    return left + right;
  }
  // ... and a few other combinations, incl. int and long
}

The bit operations are only defined for integers and longs, so the resulting BitXorNode looks like this:

abstract class BitXorNode extends BinaryNode {
  @Specialization
  public long doLongs(long left, long right) {
    return left ^ right;
  }

  @Specialization
  public int doIntegers(int left, int right) {
    return left ^ right;
  }
}

Unary Operations: Type Coercions and the not Operator

Next on our list are unary operators for value casting. Specifically, we need to implement intValue(.) and doubleValue(.). You might remember from the last post that we handled the println(.) method specially. It is a method defined on the Golo's Predefined class that takes only one parameter, too. Our solution the last time was to create a PrintlnNode during the lookup and directly replace the UninitializedInvocationNode AST node with this node. During execution, this has the major advantage that there is no overhead for function calls or even argument wrapping. To properly support Mandelbrot in Golo, we need explicit coercions for integer and double values. Of course, these are simple operations that should not have any overhead. Thus, we use the same approach.

The nodes are defined as follows:

@NodeChild(value = "expr", type = ExpressionNode.class)
abstract class UnaryNode extends ExpressionNode { }

abstract class IntValue extends UnaryNode implements PreEvaluated {

  abstract Object executeEvaluated(VirtualFrame frame, Object value);

  @Override
  public Object doEvaluated(VirtualFrame frame, Object[] args) {
    return executeEvaluated(frame, args[0]);
  }

  @Specialization
  public int doDouble(double value) {
    return (int) value;
  }
}

On the first look, this seems to be more complicated than necessary. The reason that we implement the PreEvaluated interface is because of the way node specialization should work. Normally, a specialize(.) method will replace the old node in the tree and return the new node so that it can be executed directly (see previous post). However, it needs to take the already evaluated arguments. For this purpose, we implement the interface and the corresponding doEvaluated(.) method. The implementation for the abstract executeEvaluated(.) will be generated by the TruffleDSL and we do not have to do that manually. The main functionality of the node is realized with the specialization doDouble(.), which merely casts the double value to an integer value. For the doubleValue(.) operation, the implementation is essentially the same.

The not operator is handled differently in Golo. Instead of being a function invocation, it is a keyword. We will skip over how to add the necessary support to the Golo IR to Truffle visitor. In the end, the executing AST node is simply this:

abstract class NotNode extends UnaryNode {
  @Specialization
  public boolean doBoolean(boolean value) {
    return !value;
  }
}

Local Variables

One issue we have kept putting off so far is to support the notion of local variables. For our Fibonacci function, we merely needed support for accessing arguments (see previous post, sec. 4.2). This was realized by accessing the array that is passed on function invocation and is stored in the frame object given to the execute*(.) methods. Now, we will use these frames to also store local variables.

As a little bit of background, frames are the activation records of functions. Thus, they keep the temporary state that is needed during the execution of a function. This includes access to the actual arguments, local variables, and possibly other execution state.

Truffle distinguishes two types of frames, virtual and materialized frames. The first type is called virtual to indicate that the optimizer will not allocate an object for the frame at runtime, instead, it will use the frame to figure out the data dependencies between operations within a compilation unit, i.e., a function and possibly additionally inlined functions. To make this reliable the usage of virtual frames is restricted. For instance, they are not supposed to be assigned to fields of other objects and cannot be passed to methods of objects where Graal cannot determine the concrete method at compilation time. Generally, virtual frames cannot escape the compilation unit, because this would mean that they need to be represented as a proper object. Materialized frames on the other hand can be used as normal objects. Graal does not impose the same restrictions on them, but instead, they come with a runtime cost. Materialized frames are useful to implement features like closures.

In addition to giving the compiler the means to determine data dependencies, frames are also meant to help it with determining the concrete type information. For both reasons, frames come with a FrameDescriptor. On the one hand, it maintains the structural information about the slots of a frame, and on the other hand, the frame slots can record type information. For now, we will focus on the structural elements, and will ignore the type information.

What does this mean concretely for our interpreter? For our Function objects it means, we need to give them proper FrameDescriptors so that the structure of the frame is known. We do that in our visitFunction(.) method:

public Function visitFunction(GoloFunction function) {
  FrameDescriptor frameDesc = new FrameDescriptor();
  context.frameDescriptors.push(frameDesc);

  ExpressionNode body = function.getBlock().accept(this);

  context.frameDescriptors.pop();
  return new Function(body, function, frameDesc);
}

While it is not really necessary to support Mandelbrot, we have a stack of frameDescriptors in the visitor, because Golo supports nested functions. And with this stack, we can model the lexical scoping of them correctly.

To access variables, we need to add support for it to the visitReferenceLookup(.) method, next to the support for reading arguments:

public ExpressionNode visitReferenceLookup(ReferenceLookup referenceLookup) {
  LocalReference reference = referenceLookup.
          resolveIn(context.referenceTableStack.peek());
  if (reference.isArgument()) {
    return new LocalArgumentReadNode(reference.getIndex());
  } else {
    FrameSlot slot = getFrameSlot(reference);
    return LocalVariableReadNodeGen.create(slot);
  }
}

FrameSlot getFrameSlot(LocalReference reference) {
  return context.frameDescriptors.peek().
          findOrAddFrameSlot(reference.getName());
}

When we see that a reference is not an argument, we create a LocalVariableReadNode. This read node gets a FrameSlot object that is created by the frame descriptor based on the name of the variable. Since we do not care about type information for the moment, a frame slot is merely Truffle's handle to represent the variable.

Assignment statements are transformed similarly:

public ExpressionNode visitAssignmentStatement(
    AssignmentStatement assignment) {
  LocalReference reference = assignment.getLocalReference();
  FrameSlot slot = getFrameSlot(reference);
  return LocalVariableWriteNodeGen.create(
      slot, (ExpressionNode) assignment.
                          getExpressionStatement().accept(this));
}

For them, we create a LocalVariableWriteNode. But of course, a write node still needs the expression that computes the value it needs to write. So, beside the slot, it also gets the transformed subexpression of the assignment statement.

For the implementation of the read node, we got the following code:

abstract class LocalVariableReadNode extends ExpressionNode {

  protected final FrameSlot slot;

  public LocalVariableReadNode(FrameSlot slot) {
    this.slot = slot;
  }

  @Specialization
  public Object doObject(VirtualFrame frame) {
    return frame.getValue(slot);
  }
}

The node is structured as usual. It takes the slot object as argument and stores it in a final field so that the compiler knows it can rely on it as a constant. The specialization does merely take the slot object to read a value from the frame. There could be more specialization to read different primitive types from the frame, but for the moment this is not necessary.

The write node looks more or less the same:

@NodeChild(value = "expr", type = ExpressionNode.class)
abstract class LocalVariableWriteNode extends ExpressionNode {

  protected final FrameSlot slot;

  public LocalVariableWriteNode(FrameSlot slot) {
    this.slot = slot;
  }

  @Specialization
  public Object writeGeneric(VirtualFrame frame, Object exprValue) {
    slot.setKind(FrameSlotKind.Object);
    frame.setObject(slot, exprValue);
    return exprValue;
  }
}

Here we need to have the expression for the value that is to be written. So, it is declared with an annotation. In the specialization, we get it as the expValue argument. Note, when executing writeGeneric(.), we first ensure that the slot kind has been set to Object. The optimizer is able to remove this, because it only reaches frame slots that have been initialized. Finally, we set the expValue on the frame.

And that's all. With these nodes we support read and write operations for local variables.

Loops

The final element we need for the Mandelbrot function is iteration, or loops. Golo got multiple types of loops including for and while loops. Fortunately for us, Golo's IR already takes care of desugaring these constructs to a single loop construct with explicit initialization, condition, body, and post-condition.

The Main Loop Node

We translate this construct to the following node implementation:

class ForLoopNode extends ExpressionNode {

  @Child protected ExpressionNode init;
  @Child protected LoopNode loopNode;

  public ForLoopNode(ExpressionNode init,
      ExpressionNode condition,
      ExpressionNode body,
      ExpressionNode post) {
    this.init = init;
    loopNode = Truffle.getRuntime().createLoopNode(
        new RepeatNode(condition, body, post));
  }

  @Override
  public Object executeGeneric(VirtualFrame frame) {
    if (init != null) {
      init.executeGeneric(frame);
    }
    try {
      loopNode.executeLoop(frame);
    } catch (BreakLoopException e) { /* just left the loop */ }
    return null;
  }
}

Our ForLoopNode got two @Child nodes. The first one is the init expression. In a for loop like in Java or C, it typically initializes an induction variable, e.g., i, with an initial value. The second child node is Truffle's LoopNode, which realizes the iteration with support for on-stack-replacement. More on that in a bit.

The executeGeneric(.) first executes the init expression if available, and afterwards the loop. In case the loop contained a break keyword, we would throw a control-flow exception that is caught here.

Using Truffle's LoopNode

The actual iteration happens in the LoopNode. For that, it requires an implementation of the RepeatingNode interface. The interface has a single method which does a loop iteration and indicates whether it should continue. As mentioned before, this abstraction is ment to enable on-stack-replacement. This is a useful feature for long-running methods that contain loops. Normally, the execution would only reach a newly compiled native code version of a method when it is entered again. With on-stack-replacement however, the VM can detect methods with long-running loops and switch to the optimized native code even while executing the loop.

Our implementation of the RepeatingNode interface looks as follows:

class RepeatNode extends Node implements RepeatingNode {

  @Child protected ExpressionNode condition;
  @Child protected ExpressionNode body;
  @Child protected ExpressionNode post;

  public RepeatNode(ExpressionNode condition,
      ExpressionNode body, ExpressionNode post) {
    this.condition = condition;
    this.body = body;
    this.post = post;
  }

  @Override
  public boolean executeRepeating(VirtualFrame frame) {
    if (shouldExecute(frame)) {
      body.executeGeneric(frame);
      if (post != null) {
        post.executeGeneric(frame);
      }
      return true;
    } else {
      return false;
    }
  }

  private boolean shouldExecute(VirtualFrame frame) {
    try {
      return condition.executeBoolean(frame);
    } catch (UnexpectedResultException e) {
      CompilerDirectives.transferToInterpreter();
      throw new UnsupportedSpecializationException(
          this, new Node[]{condition}, e.getResult());
    }
  }
}

It got child nodes for the loop condition, the loop body, and a post expression that is evaluated after each iteration. The method shouldExecute evaluates the loop condition. But since Golo is a dynamically typed language, we can't be completely sure that the condition expression really returns a boolean. Here we handle this case simply be throwing an exception that indicates that this case is not yet supported.

The executeRepeating(.) method represents one iteration. Thus, it first checks whether the loop should continue executing by invoking shouldExecute(.), and if that is the case, it executes the loop's body. After executing the body, it also executes the post expression if it is present. The return value of executeRepeating(.) indicates whether the iteration should be continued. Beside the minor complications, implementing a loop construct is thus pretty simple. If on-stack-replacement is not desired for some reason, one could implement the loop also directly using any Java loop construct.

The break Keyword in Loops

As we have seen earlier, Golo also supports the break keyword to exit from loops. The implementation idea is very similar to handling the return keyword. The AST node for the break keyword is going to throw a control-flow exception. And as we have seen in the ForLoopNode, we handle it with a try {...} catch (BreakLoopException e) around the execution of the loop body. Thus, the break keyword is implemented with the following BreakLoopNode:

class BreakLoopNode extends ExpressionNode {

  @Override
  public Object executeGeneric(VirtualFrame frame) {
    throw new BreakLoopException();
  }
}

The node will simply raise the BreakLoopException which returns execution to the ForLoopNode. After catching the exception, execution simply continues after the loop.

Invoking Arbitrary Java Methods

Golo is designed to be a dynamic language for the JVM ecosystem. Thus, it embraces Java and its ecosystem and Golo programs use Java libraries throughout. So far, we added dedicated nodes for each functionality. Applying the same approach to call arbitrary Java methods would not work. So, we need a way to call Java methods in a more reflective fashion.

Since Golo's lookup mechanism returns for some cases already Java MethodHandles, we added a MethodHandleInvokeNode:

class MethodHandleInvokeNode extends FunctionInvocationNode {
  private final MethodHandle method;

  public MethodHandleInvokeNode(FunctionInvocationNode uninit,
      MethodHandle method) {
    super(uninit.name, uninit.module, uninit.argumentsNode);
    this.method = method;
  }

  @Override
  public Object executeEvaluated(VirtualFrame frame,
      Object[] args) {
    try {
      return method.invokeWithArguments(args);
    } catch (Throwable e) {
      throw new NotYetImplemented();
    }
  }
}

This node becomes part of the function invocation specialization discussed in the previous post. So, if we get a MethodHandle as return, the UninitializedFunctionInvocationNode rewrites itself to this MethodHandleInvokeNode. When it is executed, it simply calls invokeWithArguments(.).

For our benchmark, this is used to call the methods that read the System's time.

Summary

In this post, we discussed how the interpreter can be extended to support the Mandelbrot program. For this, we extended the TypeSystem, added arithmetic operations, cast operators, support for local variables, support for Golo's loop constructs, as well as the ability to call arbitrary Java methods.

The main goal for this series of posts was to improve the numerical performance of Golo by using the Graal just-in-time compiler. So, let's look at the results. When we execute Mandelbrot now on top of Graal using our Truffle interpreter, we get the following results:

Initial Performance of Golo+Truffle

On Java, Mandelbrot takes about 108ms to execute. With Golo's bytecode compilation backend, it takes about 512ms. And, as we can see on the plot above, Golo running with the Truffle-based interpreter on top of Truffle takes about 505ms. This means, our Golo+Graal is about the same speed as the bytecode-based version.

In the next post, we'll investigate how the compilation works, and optimize our interpreter. It should not be much slower than Java, which it unfortunately still is. But more on that next time.

Truffle Concepts, Conventions, and Other Important Bits

As at the end of the last post, here also a brief overview of relevant Truffle concepts:

  • VirtualFrame object (sec. 3) cannot escape a Graal compilation unit, they cannot be assigned to fields of objects, or passed to polymorphic functions.
  • MaterializedFrame objects (sec. 3) can be used as normal objects, but have a runtime overhead.
  • FrameDescriptor (sec. 3) objects define the set of local variables in a method activation in terms of FrameSlots, which can track their runtime types.
  • The LoopNode class (sec. 4.2) enables Graal to perform on-stack-replacement to be able to optimize long-running loops during their execution

Acknowledgements

I'd like to thank Julien Ponge for his comments on drafts of this tutorial and his help with the technical details of Golo.

Creative Commons License
“Add Graal JIT Compilation to Your JVM Language in 5 Easy Steps” by Stefan Marr is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Permissions beyond the scope of this license are available on request.

Add Graal JIT Compilation to Your JVM Language in 5 Easy Steps, Step 3

Step 3: Interpreting a Simple Fibonacci Function with Golo+Truffle

First, a word of warning. This post is much longer than I had hoped. The goal for today is to be able to execute a Fibonacci function with Golo+Truffle. However, this means, we need to get acquainted with most of the basic Truffle concepts. And in the Golo implementation that corresponds to a patch that is about 1500 LOC large. So please bear with me, this gets a little long-winded.

In case you missed the previous posts, this series starts with a brief intro and motivation. Today's post discusses the basics of self-optimizing interpreters using the Truffle framework. Some of the discussions are also already including details on how the compilation with Graal works. However, the main focus is on the basic interpreter functionality.

This post discusses how to create a Truffle abstract syntax tree (AST) from the Golo intermediate representation (IR), explains the basic mechanism of the Truffle framework, e.g., execute() methods, the TruffleDSL that generates much of the boilerplate code, self-optimization and node specialization, basic interpreter functionality such as function arguments, function invocation, and control flow constructs. So, just enough functionality to get a basic Fibonacci function working. At the end of the post, we also provide a brief overview of the most important bits.

Adding Basic Truffle Support

The first step is to configure the Gradle build system to include the Truffle dependencies. As of this writing, the latest released of Truffle and Graal is version 0.9 (binary builds).

We need the basic Truffle libraries as well as the TruffleDSL, which makes it easier to implement the basic operations for our interpreter. The build dependencies are configured in the build.gradle file. We need to add the Truffle jar as a compile dependency. Furthermore, we need a build plugin to enable annotation processing support for the TruffleDSL. The DSL uses annotations to generate implementation classes for us. Since the DSL itself also depends on Truffle, the Truffle jar needs to be registered as an apt dependency as well. Note, at the moment we also need to use a custom Maven repository with the latest Truffle jars:

repositories {
  maven {
    url "http://lafo.ssw.uni-linz.ac.at/nexus/content/repositories/releases/"
  }
  jcenter()
}

plugins {
  // ...
  id "net.ltgt.apt" version "0.3"
}

dependencies {
  // ...
  compile 'com.oracle.truffle:truffle-api:0.9'
  apt     'com.oracle.truffle:truffle-api:0.9'
  apt     'com.oracle.truffle:truffle-dsl-processor:0.9'

  // ...
}

Working in the Debugger

The next step might seem superfluous, but it is going to be extremely useful. The first class we add to the Golo codebase is a NotYetImplemented exception:

class NotYetImplemented extends RuntimeException {}

With this class, we can mark all open issues/todo items and make the compilers happy, which allows us to do all our changes in the debugger. Why in the debugger? Because, I don't want to guess what Golo is doing. I don't know Golo. So, I just put a breakpoint where I assume I'll need to change something, and then start hacking there. The debugger then can just tell me what arguments I got, their values, etc. Trivial, but very convenient in an unknown any codebase.

Goal: Convert Golo's IR to a Truffle AST for to run Fibonacci

The next step is to add an option to Golo's start mechanism to convert the Golo IR not into bytecodes but to give us a Truffle AST that can be executed. For that, we add a --truffle option to the golo command. If we set it on the command line, Golo will parse the files as normal and then use a visitor to convert the resulting GoloModule into a set of Truffle-based functions/ASTs (cf. change set).

The resulting GoloModules represent Golo code in an IR that is already pretty high-level, and it would require only small changes to be compatible with Truffle. However, for clarity and to avoid artifacts that are not idiomatic in Truffle, I chose to keep the IR and the Truffle AST clearly separated. In a practical system, it might not be necessary.

As a result of separating the Truffle trees from the IR, we only need to add accept(.) methods to the IR. They perform a double dispatch to the corresponding method in the TruffleGenerationGoloIrVisitor. In a first step, the visitor could be left almost empty and just needs visitor methods for all IR elements, which in turn can raise the NotYetImplemented exception. The full visitor implementation was added in my branch with this change.

With this basic infrastructure in place, we can run Golo from the debugger executing the following program: golo --truffle --files samples/fib.golo

module samples.Fibonacci

function fib = |n| {
  if n < 2 {
    return n
  } else {
    return fib(n - 1) + fib(n - 2)
  }
}

function main = |args| {
  println(fib(10))
}

This is a simple recursive version of Fibonacci. To run it with Truffle, we need to implement function calls, function arguments, control flow constructs, and basic arithmetic operations. Sounds simple, but it will keep us busy for the rest of this post.

By now, the debugger should have hit the first NotYetImplemented exception in the visitor method for modules. Ignoring some Golo details, it is sufficient to iterate over all GoloFunctions and visit them one by one. The corresponding visitFunction(.) eventually needs to assemble a Truffle-compatible function. But, following the relevant execution flow in the debugger, we first transform the fib function's block, i.e., its body to a Truffle AST and worry about the details of creating a function object later.

Truffle Basics: Nodes, execute*() Methods, TypeSystems, and Frames

The first element we need is a common root node for all possible AST nodes. In Golo's IR, the corresponding class is essentially GoloStatement. For our purposes we add ExpressionNode,11 We use the term expression here, because most nodes will need to provide a return value for the execute*() methods. which needs to extend Truffle's Node class. This ExpressionNode needs to provide a set of execute*(frame) methods. The execute*() methods implement the semantics of an AST node. To be able to specialize the behavior of a node on the kind of value it should return, the ExpressionNode provides a set of basic implementations for the types we want to be able to specialize on as return types for expressions. These types are communicated explicitly to the TruffleDSL to generate some type-checking helper methods. We do this with the following Types class:

@TypeSystem({
  int.class,
  boolean.class,
  Object[].class
})
class Types { }

With the @TypeSystem annotation, we list int, boolean, and Object[] as the three types of interest. Note, the object array is going to be used only for arguments to function calls. Because, our Fibonacci program only needs integers and booleans as supported data types.

For the ExpressionNode class, we need the following code:

public abstract class ExpressionNode extends Node {
  public abstract Object executeGeneric(VirtualFrame frame);

  public int executeInteger(VirtualFrame frame) throws UnexpectedResultException {
    return TypesGen.expectInteger(executeGeneric(frame));
  }

  public boolean executeBoolean(VirtualFrame frame) throws UnexpectedResultException {
    return TypesGen.expectBoolean(executeGeneric(frame));
  }

  public Object[] executeObjectArray(VirtualFrame frame) throws UnexpectedResultException {
    return TypesGen.expectObjectArray(executeGeneric(frame));
  }
}

The abstract executeGeneric(frame) method on the second line is the generic execute method each AST node needs to implement. The other three type-specific methods can be implemented if it makes sense for a node. Otherwise, the fallback implementation provided here will call executeGeneric(.), and then check that the result is of the expected type. For the check, we use the expect*() methods, which are generated from the Types class. Technically it is optional, but avoids some boilerplate code.

Generally, these fallback methods are useful for optimistic, i.e., speculative optimizations, which might not hold for all future executions of a node. We'll go into more details on this later.

One might also notice the frame argument. The frame is the activation record of a Truffle function and is used to access the actual arguments to a function call, local variables, and other state a language implementation might need in the scope of a function. Later we will see the argument access. For the other aspects, please see a later post in this series.

Sequences

With the ExpressionNode in place, we can now transform the body of the GoloFunction, which is called a block and corresponds to a distinct lexical scope in Golo. To represent the content of such blocks, which can be multiple expressions, we use the following SequenceNode that holds the expressions as child nodes:

class SequenceNode extends ExpressionNode {
  @Children private final ExpressionNode[] expressions;

  public SequenceNode(ExpressionNode[] expressions) {
    this.expressions = expressions;
  }

  @ExplodeLoop
  public Object executeGeneric(VirtualFrame frame) {
    for (int i = 0; i < expressions.length - 1; i++) {
      expressions[i].executeGeneric(frame);
    }
    return expressions[expressions.length - 1].executeGeneric(frame);
  }
}

For the Truffle framework, it is important that the expressions field of the class is annotated with @Children so that specializations are handled correctly and so that the Graal compiler knows these expressions can be considered constant. The final keyword is here not strictly necessary, but in general it is good practice to make fields final whenever possible to facilitate constant folding during compilation.22 Fields annotated with @Child cannot be final, because they need to be able to specialize themselves, and the compiler treats them specially.

The executeGeneric(.) method iterates over these child expressions and recursively calls executeGeneric(.). The last expression is different, because we need to return its result value.

To enable the Graal compiler later on to optimize this code as much as possible, it is necessary to mark this loop with the @ExplodeLoop annotation to ensure it is unrolled. This makes sure that the call to executeGeneric(.) is not treated as polymorphic. By unrolling the loop, the compiler can inline the specific executeGeneric(.) methods for each expression[i]. This enables many common compiler optimizations. The unrolling is safe here, because the number of statements in a block is lexically defined, and thus constant. Depending on language semantics, this might for instance also apply to the processing of function arguments, as we will see later.

Note, while stepping through the debugger, the block we encountered should be the one that contains the if ... {} else {} construct of the fib function, which we dive into now.

Argument Reads

Continuing from the visitBlock(.) function in the debugger, transforming our fib function step-by-step, the next element is going to be the conditional branch. It is transformed to Truffle nodes in the visitConditionalBranching(.) method. As a first step, we visit the condition n < 2. When stepping into it, the left operand is a argument read of n.

As mentioned earlier, function arguments are passed as Object[] arrays in the frame, i.e., activation record of the function. To access them, we add the LocalArgumentReadNode class, which is sketched below. Such a node keeps the index of the argument it corresponds to, and on execution, simply accesses the array from the frame object and returns the value:

class LocalArgumentReadNode extends ExpressionNode {
  protected final int index;

  public LocalArgumentReadNode(int index) {
    this.index = index;
  }

  public Object executeGeneric(VirtualFrame frame) {
    return frame.getArguments()[index];
  }
}

Literals

The right operand is a literal integer 2. Literal nodes are the simplest nodes we have, because it simply returns the stored value. For integer values, we add the IntegerLiteralNode below. Note, for the compiler later on, it is important that the val field, in which the value is stored, is final. This ensures that the compiler can treat the value as constant to enable optimizations.

class IntegerLiteralNode extends LiteralNode {
  private final int val;

  public IntegerLiteralNode(int val) {
    this.val = val;
  }

  public int executeInteger(VirtualFrame frame) {
    return val;
  }

  public Object executeGeneric(VirtualFrame frame) {
    return val;
  }
}

TruffleDSL, Basic Operators, and Returning from a Function

So far, we needed only very basic mechanisms of Truffle and mostly followed the framework by implementing abstract methods. Now, we are moving on to use the TruffleDSL to provide functionality that is, for instance, specific to certain types that we want to optimize for. The TruffleDSL facilitates this by providing annotations that are used to generate much of the necessary boilerplate code.

The DSL provides the notions of nodes and their children, i.e., subexpressions. Furthermore, it enables us to implement specializations based on types as well as other runtime information. The previously discussed execute*() methods are essentially the interface to subnodes. Using this interface, the DSL generates the self-specializing code to instantiate node objects, provide the specialization checks, evaluate subnodes, do node rewriting, and handle potential polymorphism, i.e., the case that more than one specialization is required for a specific point in the program (AST node).

In the following, we discuss the basic concepts to implement binary operations in the Truffle style.

Node Children and Specializations

Back to the Fibonacci function, the comparison operation for our n < 2 expression is implemented as LessThanNode. For convenience, we extend the BinaryNode class below:

@NodeChildren({
  @NodeChild(value = "left",  type = ExpressionNode.class),
  @NodeChild(value = "right", type = ExpressionNode.class)
})
public abstract class BinaryNode extends ExpressionNode { }

The BinaryNode is an abstract class without any methods. Its main use is to provide the left and right subnodes, i.e., node children by using the TruffleDSL. As the code above shows, we use the @NodeChildren annotation with two @NodeChild annotations that define left and right to be of type ExpressionNode. The main benefit of using the DSL is that it generates the boilerplate code for specializations and handling many of the common issues for optimistic optimizations. In our specific case here, it enables use to implement the LessThanNode with the following code:

public abstract class LessThanNode extends BinaryNode {

  @Specialization
  public boolean doIntegers(int left, int right) {
    return left < right;
  }
}

Note that LessThanNode is again an abstract class. However, it provides the doIntegers(.,.) method that is annotated with @Specialization. This means, the DSL will generate the code to evaluate the left and right subexpressions of the < operation, which in our case are an argument read and an integer literal. If the evaluation returns the expected integers, it will make sure that the AST is rewritten to use the doIntegers method. Otherwise, it would throw an UnsupportedSpecializationException at runtime. For our Fibonacci support, this behavior is good enough. For a complete Golo implementation, we would of course need to provide the other necessary specializations as well.

At this point, note that the signature of doIntegers(.,.) as well as the return type of executeInteger(.) uses the primitive int and thus avoid boxing of integers.

Control Flow: The return Keyword

After completing the condition, we are back in the visitConditionalBranching(.) method with the debugger. The next element, for which we need to provide an implementation is the then-branch of the condition. Stepping into it, we reach the visitReturnStatement(.) method. In our fib function, we know this is the return n branch. So, there will be an argument read again, which we already covered. Thus, we can focus on the return itself.

Imagine for a moment what happens when we execute the kind of AST that we are just building up. The Java stack trace might look something like this:

ReturnNode.executeGeneric()
Sequence.executeGeneric()
IfNode.executeGeneric()
Sequence.executeGeneric()
Function.execute()

At the bottom of the stack, we see the function call itself, and then we see the different AST nodes with their executeGeneric() methods executing. At the top, we see the node that corresponds to the return operation we want to implement. So, how can we return from this node back to the function and at the same time also pass the desired return value? In Truffle, the solution is rather straightforward and uses ControlFlowExceptions. The class ControlFlowException minimizes the cost of an exception at runtime, for instance, by not capturing a stack trace. To use it, we subclass it with the following ReturnException:

class ReturnException extends ControlFlowException {
  private final Object result;

  public ReturnException(Object result) {
    this.result = result;
  }

  public Object getResult() {
    return result;
  }
}

This ReturnException takes the result we want to return from the function, and when it is thrown, it will unwind the Java stack for us. Thus, we only need to make sure that we catch it later on in the Function class that we still need to implement. More on that later.

In addition to the ReturnException, we also need the corresponding AST node:

class ReturnNode extends ExpressionNode {
  @Child protected ExpressionNode expr;

  public ReturnNode(ExpressionNode expr) {
    this.expr = expr;
  }

  public Object executeGeneric(VirtualFrame frame) {
    throw new ReturnException(expr.executeGeneric(frame));
  }
}

ReturnNode evaluates its child node expr, which is marked with @Child. It uses the resulting value to create a new ReturnException, which is thrown to unwind the stack and return from the function. So, in case of the return n expression in the fib function, this means, we read the argument, wrap the result into an exception object that is than thrown. And that's basically it. To see how the exception is caught, see section 6.5 on implementing the Function class. Note, we did not mark the expr field as final. Since the expression node is supposed to specialize itself during execution, it needs to mutate this field. For the compilation, the @Child annotation is sufficient to consider the field as a constant for optimization.

Addition and Subtraction Operators

Moving on in the debugger should get us back to the visitConditionalBranching(.) method, which we continue to transform by processing the else branch: return fib(n - 1) + fib(n - 2). Handling the return was just covered. So, the next elements of interest are the binary operators + and -. They are implemented essentially in the same way as the less-than operator <. We use the TruffleDSL and its support for specializations:

public abstract class PlusNode extends BinaryNode {
  @Specialization
  public int doIntegers(int left, int right) {
    return left + right;
  }
}

public abstract class MinusNode extends BinaryNode {
  @Specialization
  public int doIntegers(int left, int right) {
    return left - right;
  }
}

Both nodes are implemented as subclasses of BinaryNode and provide the specialized behavior for adding or subtracting integers. With this minimalism, the same caveats apply as for the LessThanNode, i.e., for a full Golo implementation, we would need to provide support for other possible types as well.

Functions and Conditional Control Flow

So far, we merely dabbled with the very basics of language implementation. In this section, we finally make the step to add function calls and conditional control flow. Note, this time, we will not rely on the DSL. Instead, we are going to do some manual node specialization. I hope this provides some additional insights on how these types of self-optimizing interpreters work.

Function Arguments

The generation of the fib(.) function invocations are handled by visitFunctionInvocation(.). As the first step, it needs to transform the argument expression of the function call. Ignoring Golo's named arguments, we can transform the argument expressions into a plain ExpressionNode[]. To avoid having different function invocation nodes for different numbers of arguments, we introduce an EvalArgumentsNode that will create an array of objects that is passed to the function call. That's also the reason why we needed the Object[] type in our Types class earlier.

class EvalArgumentsNode extends ExpressionNode {
  @Children protected final ExpressionNode[] argumentNodes;

  public EvalArgumentsNode(ExpressionNode[] argumentNodes) {
    this.argumentNodes = argumentNodes;
  }

  @ExplodeLoop
  public Object[] executeObjectArray(VirtualFrame frame) {
    Object[] arguments = new Object[argumentNodes.length];
    for (int i = 0; i < argumentNodes.length; i++) {
      arguments[i] = argumentNodes[i].executeGeneric(frame);
    }
    return arguments;
  }

  public Object executeGeneric(VirtualFrame frame) {
    return executeObjectArray(frame);
  }
}

The implementation of EvalArgumentsNode is very similar to the one of SequenceNode. The main difference is that we create an Object[] array that captures the actual argument values for the call. Note again that the argumentNodes field is annotated with @Children and the executeObjectArray(frame) method is annotated with @ExplodeLoop to help the optimizers.

Function Invocation in Self-Optimizing Interpreters

Once we got the EvalArgumentsNode object, we can create the actual FunctionInvocationNode. This one is quite a bit more complex than any of the nodes we had before. Before we start looking at it, note that we care for the moment only about simple function invocation to targets that can be resolved statically. Thus, we do not support any form of polymorphism yet. This also means that we are going to implement the node manually. The TruffleDSL provides with the @Cached mechanism a nice way of implementing custom polymorphic inline caches. But, here this approach would be more complicated without providing a benefit.

Before we go into the details, let's back up even more. One of the main assumptions for self-optimizing interpreters in the Truffle style is that self-optimizations are designed so that they always have converging, i.e., stabilizing behavior. To be able to generate native code, we need to avoid getting into an endless loop of back and forth switching between different optimizations. Thus, one needs to design node specializations as a state machine that will reach a final state within a finite and small number of steps.

With this in mind, let's have a look at what we need to do for the function invocation. As already mentioned, we do not support polymorphic functions. And for our Fibonacci example, we have two cases. The first one is the call to the fib(.) function itself, which we just hit in the debugger. And the second one is the call to the println(.) function, which we need later to generate output on the terminal. The fib(.) function is going to be resolved statically, and println(.) is a method on Golo's Predefined class, which provides such convenience methods. We will handle println(.) specially, just to show a little more what we can do with Truffle.33 In a full Golo+Truffle backend, one probably wants a more generic mechanism. But, we are going to look at that only in the next part of this series of blog posts.

So, overall, we got three cases to cover. The initial function invocation node that is uninitialized, a normal static function call, and well-know builtins that we want to treat specially. This means, we can design a node that goes from an uninitialized case to either of the two specializations depending on the result of a lookup function.

We'll start with the abstract FunctionInvocationNode:

abstract class FunctionInvocationNode extends ExpressionNode implements PreEvaluated {
  protected final String name;
  protected final GoloModule module;

  @Child protected EvalArgumentsNode argumentsNode;

  FunctionInvocationNode(String name, GoloModule module, EvalArgumentsNode argumentsNode) {
    this.name          = name;
    this.module        = module;
    this.argumentsNode = argumentsNode;
  }

  public abstract Object executeEvaluated(VirtualFrame frame, Object[] args);

  public Object doEvaluated(VirtualFrame frame, Object[] args) {
    return executeEvaluated(frame, args);
  }

  public Object executeGeneric(VirtualFrame frame) {
    Object[] args = argumentsNode.executeObjectArray(frame);
    return executeEvaluated(frame, args);
  }
}

It is a subclass of ExpressionNode and as such we implement executeGeneric(frame). Furthermore, we implement argument evaluation here, since it is always the same independently of how an invocation is done. Thus, we take the argumentsNode and ask it to evaluate to an Object[], which it will by construction (see the just implemented EvalArgumentsNode).

With the result, we call executeEvaluated(frame, args). This method is a convention that is also used by the DSL. The idea is that a node will first evaluate all child nodes, and then at some point execute the node's own logic. In this case, we only provide an abstract method that is implemented in the concrete subclasses. Note also that we implemented an interface PreEvaluated, which gives us a way to always execute only the node-specific behavior without evaluating child nodes. This is useful for instance during specializing a node. First we need to evaluate the child nodes to be able to determine the applicable specialization, and then we want to execute only the node-specific behavior. Since Golo has side-effects, just repeating the evaluation of the child nodes could be incorrect.

For the specialization of function invocations, the code looks as follows:

class UninitializedInvocationNode extends FunctionInvocationNode {

  UninitializedInvocationNode(String name, GoloModule module,
      EvalArgumentsNode argumentsNode) {
    super(name, module, argumentsNode);
  }

  public Object executeEvaluated(VirtualFrame frame, Object[] args) {
    return specialize(args).
        doEvaluated(frame, args);
  }

  private PreEvaluated specialize(Object[] args) {
    Object lookupResult = lookup(args);

   if (lookupResult instanceof Function) {
      return replace(new DirectFunctionInvokeNode(this, (Function) lookupResult));
    } else if (lookupResult instanceof PreEvaluated) {
      return (PreEvaluated) replace((ExpressionNode) lookupResult);
    }
    throw new NotYetImplemented();
  }

  private Object lookup(Object[] args) {
    // ...
  }
}

Our UninitializedInvocationNode class implements the abstract executeEvaluated method. As discussed, it already gets the evaluated arguments, passes them to the specialize(.) function, and then delegates the evaluation to the resulting node by calling doEvaluated(.,.). The specialize(.) function performs the lookup, and then based on the lookup result, it chooses an appropriate implementation for the invocation. At this point, we only need the two previously discussed ways for invocation. The first one is the DirectFunctionInvokeNode. In case the lookup result was a Golo+Truffle function, we instantiate the invocation node and replace(.) the UninitializedInvocationNode node, i.e., the current this with it.

Just as a side note, the direct use of replace(.) is only advisable for advanced use cases and requires extra care. Normally, the DSL shields us from this operation. replace(.) is going to adapt the AST and makes sure that the parent pointer in the newly inserted node is set correctly. It also performs a few checks to make sure the replacement is legal. If the parent node's field type is not compatible with the new node, we will get an error. While there are such safeguards, we have to manually make sure that a node is only used once in the AST. This is required because the nodes can only have a single parent, which is required to make the specialization logic work.

In case the lookup result is some node that implements the PreEvaluated interface, we use it immediately to replace the current node.

Invoking Truffle Functions

The DirectFunctionInvokeNode only has to perform the actual function invocation. As discussed earlier, here we only handle cases that are statically resolved. Thus, there is no further dynamicity that needs to be handled. The only special thing we have to do is to use Truffle's mechanism to function invocation so that the Truffle framework can apply optimizations such as splitting and inlining. To this end, the following implementation uses a DirectCallNode:

class DirectFunctionInvokeNode extends FunctionInvocationNode {
  @Child protected DirectCallNode call;

  protected DirectFunctionInvokeNode(
      FunctionInvocationNode uninit, Function function) {
    super(uninit.name, uninit.module, uninit.argumentsNode);
    call = Truffle.getRuntime().createDirectCallNode(
                function.getCallTarget());
  }

  public Object executeEvaluated(VirtualFrame frame, Object[] args) {
    return call.call(frame, args);
  }
}

The DirectCallNode caches the so-called CallTarget. CallTarget's are Truffle's notion of invokable or callable entities. Thus, they are the executable elements, and define the calling convention. Without going into the details of the Function implementation already, our Truffle Functions will cache such a CallTarget that is used in DirectCallNodes. These nodes are created, as shown here, by the Truffle runtime. They represent statically bound calls, and communicate this knowledge to the compiler.44 They can also be used, for instance in combination with the @Cached annotation to build polymorphic inline caches. When a call-site is highly variable (megamorphic), i.e., corresponds to many different functions, we could use the IndirectCallNode instead. In that case, the optimizer would not perform inlining, because it would most likely slow down execution.

With this support for function invocation, we can transform the remainder of the fib() function. Stepping through the debugger should eventually lead us back to visitReturnStatement() where the ReturnNode for the else branch is created. After this is done, we are finally back at visitBlock(.), where the SequenceNode for the block of the else branch is created.

Control Flow: Conditionals

With the SequenceNode for the else branch of the conditional, we finally got all elements needed to assemble the conditional itself. It is implemented as the following IfNode class:

class IfNode extends ExpressionNode {
  @Child ExpressionNode condition;
  @Child ExpressionNode thenNode;
  @Child ExpressionNode elseNode;

  public IfNode(ExpressionNode condition,
      ExpressionNode thenNode, ExpressionNode elseNode) {
    this.condition = condition;
    this.thenNode  = thenNode;
    this.elseNode  = elseNode;
  }

  @Override
  public Object executeGeneric(VirtualFrame frame) {
    if (doCondition(frame)) {
      thenNode.executeGeneric(frame);
    } else if (elseNode != null) {
      elseNode.executeGeneric(frame);
    }
    return null;
  }

  private boolean doCondition(VirtualFrame frame) {
    try {
      return condition.executeBoolean(frame);
    } catch (UnexpectedResultException e) {
      CompilerDirectives.transferToInterpreter();
      throw new UnsupportedSpecializationException(
          this, new Node[]{condition}, e.getResult());
    }
  }
}

As one might expect, the IfNode got three @Child nodes. The condition, the thenNode, and the elseNode. In executeGeneric(.), we first evaluate the condition, and depending on the result, either the thenNode or elseNode. Note, the elseNode is optional. The check here is also not an issue for performance, because the compiler knows from the @Child annotation that it can consider the value of elseNode as a constant. Thus, the null check will be removed during compilation.

Evaluating the condition is actually a little more complex.55 I refrained from putting in a ConditionProfile to avoid additional complexity. But usually it is a good idea, e.g., to enable elimination of unused branches. The doCondition(.) method needs to handle the case that the condition does not actually return a boolean value. With Golo being a dynamic language, this could of course happen at runtime. For our purposes, we do not provide a fallback implementation, but instead just raise an error. Hence, if executeBoolean(.) fails, we catch an UnexpectedResultException and simply throw an UnsupportedSpecializationException.

Note, the first operation in the catch clause is CompilerDirectives.transferToInterpreter(). This operation makes sure that the compiler does not include the catch clause in the compilation. We assume, well behaved programs won't use non-boolean conditions and thereby avoid the complexity of handling this special case during compilation. Instead, execution will deoptimize. Thus, it leaves the native code Graal produced and returns to this program point, i.e., to the interpreter, which is slower, but does not prevent many essential optimizations to happen in the compiler.

Truffle Functions aka RootNodes

With the IfNode constructed, we finally got all elements for our fib(.) function transformed to a Truffle representation. Now, we can create the actual Function object:

class Function extends RootNode {
  @Child protected ExpressionNode expr;

  public Function(ExpressionNode expr) {
    this.expr = expr;
    Truffle.getRuntime().createCallTarget(this);
  }

  @Override
  public Object execute(VirtualFrame frame) {
    try {
      return expr.executeGeneric(frame);
    } catch (ReturnException ex) {
      return ex.getResult();
    }
  }
}

For the moment, we need to cover three aspects here. The first is that we need to subclass Truffle's RootNode class, implement its execute(.) method, and handle the ReturnException from earlier. RootNode is the framework's class to represent the main node of a Truffle function. Its execute(.) method takes only a frame object as argument and is the main entry point for evaluating Truffle ASTs. In the case of Golo, a Function is simply referring to some ExpressionNode that we ask to execute. As you can see in the execute(.), this is also the right place to handle the ReturnException. The Function is the inner part of the call boundary. Hence, we catch the exception, ask it for the result and can then return it. In case there was no exception, we simply return the value to which the expr node evaluated.

Previously, we also discussed the CallTarget as Truffle's mechanism to optimize function calls. Since we need a CallTarget for each of our Truffle functions, we create it in the constructor and it is automatically set on this, so that it can be used later.

An Intrinsic Node for Printing

With that, we are almost set for executing the Fibonacci program. It remains only the main() function itself. When discussing function invocation, I mentioned that we are going to treat println(.) special to show a little more of how Truffle works.

As a basic implementation of the printing, we use the following node class:

@NodeChild(value = "expr", type = ExpressionNode.class)
abstract class UnaryNode extends ExpressionNode { }

abstract class PrintlnNode extends UnaryNode implements PreEvaluated {
  public abstract Object executeEvaluated(VirtualFrame frame, Object value);

  @Override
  public Object doEvaluated(VirtualFrame frame, Object[] args) {
    return executeEvaluated(frame, args[0]);
  }

  @Specialization
  public Object printObject(Object obj) {
    System.out.println(obj);
    return null;
  }
}

First, we define UnaryNode, very similar to the previously defined BinaryNode. It has one node child expr and otherwise is an empty abstract class. PrintlnNode extends UnaryNode and implements again the PreEvaluated interface so that we can use it in the specialize(.) method of UninitializedInvocationNode. We also have the abstract executeEvaluated(.), which the DSL is going to generate for us. The actual implementation is done in the printObject(obj) method, which is annotated with @Specialization. Based on this specialization, the TruffleDSL generates the missing implementation methods, so that we can use the node in the AST.

The question is now, how do we actually create the node. Below we see a snippet from the corresponding Golo lookup routine:

if (importClass == Predefined.class) {
  switch (lookup) {
    case "println":
      return PrintlnNodeGen.create(argumentsNode.getArgumentNodes()[0]);
  }
}

Since the Predefined class is completely under our control, we can provide this node implementation for println(.). In the lookup routine, we test whether the class on which the lookup is done is the Predefined class, and in case it is, and the desired function is "println", we instantiate a PrintlnNode. Note here that the TruffleDSL generated the implementation with the name PrintlnNodeGen. This class has a factory method create(expr) that takes the argument expression. In our case, we execute this code as part of the UninitializedInvocationNode that has the argument expressions stored in argumentsNode. So, we can take the corresponding expression from there. Once the lookup is done, the created node object is passed to the specialize(.), which will eventually replace the invocation node with our new PrintlnNode. This means, the print operation is directly inlined in the AST.

Wrap Up and Conclusions

In this post, many basic Truffle concepts have been introduced. From the basic AST nodes, over the design of an AST-based interpreter, arithmetic operations, control flow constructs, until function calls, we covered everything we needed to execute a simple Fibonacci function. For more on the basic mechanisms, the paper on Self-Optimizing AST Interpreters is certainly worth a read.

The code for this post is available on the truffle/fib branch. And with the following steps, it should be possible to execute the fib.golo program:

./gradlew installDist
build/install/golo/bin/golo golo --truffle --files samples/fib.golo

If everything works out, it should print 55 and exit cleanly.

In the next post, we will see the remaining features that are needed to get our Mandelbrot program running. But before we close for today, I wanted to briefly point at the differences between the bytecode generation in the last post, and how Truffle interpreters work. From my perspective, the two approaches require quite a bit of a different mindset. When generating bytecodes, one has to be very aware of how the Java Virtual Machines works. Specifically, we have to constantly track in our mind how it interacts with the stack, what the effect is of the bytecode that is eventually executing, and so on. To truly understand it, we need to be in a mindset for compilation. Thus, we need to imagine how the JVM executes and then generate instructions for it so that it does what we want.

With a Truffle-style interpreter on the other hand, we do only need to imagine what we want to do, and then can codify it in normal Java. We do not have to imagine an external machine that is doing something on our behalves, instead we just need to imagine what needs to be done for a specific piece of input program. Of course, this only goes so far. With the Graal compiler in the mix, we actually need a pretty good understanding of how the optimizers work, but at least we do not need to mess with a stack machine.

Truffle Concepts, Conventions, and Other Important Bits

  • @TypeSystem (sec. 4), optional, generates helper methods for type tests
  • execute*() methods (sec. 4) implement semantics for an expected return type
  • executeEvaluated(...) methods (sec. 6.2) are a convention used by Truffle to implement a node's semantics taking all subexpressions already fully evaluated.
  • @Specialization, (sec. 5.1) annotation of the DSL to generate the node for this functionality, i.e., optimized variant of a generic operation
  • @Children and @Child (sec. 4.1) mark subnodes to allow framework to do correct node replacement in AST. They are implicitly compilation final.
  • @NodeChildren, @NodeChild (sec. 5.1), annotation of the DSL to add subnodes in generate code
  • abstract get*() methods, for instance getLeft() and getRight() DSL generates implementations to be able to reference a subnode that is managed by DSL
  • @ExplodeLoop (sec. 4.1), annotation to ask for unrolling of loop during compilation
  • @CompilationFinal, annotation to mark a field as final for the Graal compiler even so it cannot be marked with the keyword final, for instance because of lazy initialization
  • ControlFlowException (sec. 5.2), the root class for all exceptions that are expressing control flow in a Truffle AST, optimized to avoid runtime overhead
  • nodes can be only part of an AST once, because they only have one parent

Acknowledgements

I'd like to thank Julien Ponge for his comments on drafts of this tutorial and his help with the technical details of Golo. Furthermore, I'd like to thank Andreas Wöß for his comments and clarifications of Truffle-related details.

Creative Commons License
“Add Graal JIT Compilation to Your JVM Language in 5 Easy Steps” by Stefan Marr is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Permissions beyond the scope of this license are available on request.

Add Graal JIT Compilation to Your JVM Language in 5 Easy Steps, Step 2

Step 2: Adding Bit Operations To Golo

As discussed in the previous post, we will eventually try to improve the numerical performance of Golo by adding Graal JIT compilation. However, for the chosen Mandelbrot benchmark we first need to add support for bit operations to Golo. The main reason for doing this exercise first is to get a better understanding of how the current Golo implementation works. It was designed as dynamic language to exploit the invokedynamic support of the JVM. To be able to do that, Golo uses a bytecode compilation approach. So, instead of using the abstract-syntax-tree (AST) the parser creates from the source program directly for execution, the AST is first compiled to bytecodes, which are then executed by the JVM. While this post dives into some of the details, an earlier paper on Golo can provide a few more insights.

Adapting the Parser

Golo's parser is built with JavaCC. Thus, the Golo grammar is specified in some extended variant of BNF. The parser generated by JavaCC takes .golo files and produces ASTs, which are then in multiple steps transformed to bytecodes.

To add bit operations, we need to add new binary operators to the language. Currently, Golo does neither support xor (^), left shift (<<), nor or (|). To avoid messing with the language designer's desired aesthetics, we will use the simplest and ugliest solution here and just hope they won't adopt it, but find something that fits in the language instead. So, we are simply adding the three keywords bitXOR, bitLSHIFT, and bitOR.

This means, we extend the Golo grammar by adding BIT_OPERATOR to the definition of TOKEN:

TOKEN :
{
  < MULTIPLICATIVE_OPERATOR: ("*" | "/" | "%") >
  |
  < BIT_OPERATOR: ("bitOR" | "bitLSHIFT" | "bitXOR") >
  |
// ...

Now, the parser can recognize these tokens in a program. To parse them as part of an expression, we need to adapt how binary expressions, i.e., expressions with two operands are parsed. In many languages, bit operators have the highest precedence. Thus, we adapt the rule for MultiplicativeExpression, which currently is the one that then falls back to unary expressions. The new rule for BitExpressions is listed below. It essentially means that the left operand of a BitExpression is parsed as a UnaryExpression, then there might by blank lines, then we need a BIT_OPERATOR, and eventually we expect the right operand, which itself can be some invocation.

void BitExpression() #void:
{
  Token token = null;
}
{
  UnaryExpression()
  (
    LOOKAHEAD(2) (BlankLine())? token=<BIT_OPERATOR> (BlankLine())? InvocationExpression()
    {
      jjtThis.addOperator(token.image);
    }
  )* #BitExpression(token != null)
}

The JavaCC parser uses these rules to produce code that instantiates an ASTBitExpression object, which holds the operator and the child expressions. As we see in later posts, this structure is already very close to the Truffle ASTs we will use for execution. For basic Golo however, we first need some more changes to get the bit operations working.

Implementing The Bit Operations

The basic operators are implemented in the OperatorSupport class. We merely add the following three functions:

public static Object bit_lshift(Integer a, Integer b) {
  return a << b;
}

public static Object bit_or(Integer a, Integer b) {
  return a | b;
}

public static Object bit_xor(Integer a, Integer b) {
  return a ^ b;
}

One might notice that all arguments and return values are proper objects. In our case, the arguments are Integer instead of the primitive int. Thus, the values on which basic operations are performed need to be boxed. The main reason for this design is most likely the use of invokedynamic and the benefits of a simple uniform representation.

Using InvokeDynamic To Execute Bit Operations

Before these methods can be invoked, the necessary bytecode needs to be generated. Golo uses an intermediate representation to simplify bytecode generation by ensuring some basic properties of the language. For the bit operations, we needed to add a visit(.) method for ASTBitExpression in the visitor that transforms the JavaCC AST into the Golo intermediate representation. For brevity, we skip over the details, which can be seen in the corresponding patch.

For the bytecode generation, we can have a look at the JavaBytecodeGenerationGoloIrVisitor class. The following method handles all binary operations, which includes our bit operators:

void genericBinaryOperator(BinaryOperation binaryOperation, OperatorType operatorType) {
  binaryOperation.getLeftExpression().accept(this);
  binaryOperation.getRightExpression().accept(this);
  if (!isMethodCall(binaryOperation)) {
    String name = operatorType.name().toLowerCase();
    methodVisitor.visitInvokeDynamicInsn(
      name, goloFunctionSignature(2), OPERATOR_HANDLE, (Integer) 2);
  }
}

While this is a small and simple method, one gets a bit of the feeling of how the bytecode generation works. The first thing the method does is to take the left expression and generate bytecode for it. Afterwards, it generates the bytecode for the right expression. To fully understand what this means, we need to recall that the JVM is a stack machine. So, after the bytecode was executed that belongs to the left expression, the result will remain on the stack. This is not obvious here, and only implied by the JVM's execution semantics. Similarly, the right hand side will also be evaluated to a value on the stack. In the simplest case, this means, there are now two values on the stack.

For our bit operations, the next step is crucial. It will take the operator type and convert its name into lower case, which then corresponds to our bit_xor, bit_lshift, or bit_or methods. Afterwards, it creates an invokedynamic call to these methods. The call consumes the two values on the stack that the left and right expression produced, and will push a new result value onto the stack. The code of the bytecode generation hides much of these details. But, I hope this discussion gives an impression of what happens during compilation as well as later during execution.

Conclusion

To summarize, Golo uses a three-step approach to bytecode generation. In the first step, the JavaCC-based parser generates an AST. This AST is than transformed in a more convenient intermediate representation, which is then in a third step compiled to bytecodes.

As we could see above, Golo relies on invokedynamic and uses a uniform Object-based representation for all values.

In the next post, we will finally start with the basics of how to apply Graal and Truffle to Golo and implement as sufficient set of the language to run a classic Fibonacci computation.

Acknowledgements

I'd like to thank Julien Ponge for his comments on drafts of this tutorial and his help with the technical details of Golo.

Creative Commons License
“Add Graal JIT Compilation to Your JVM Language in 5 Easy Steps” by Stefan Marr is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Permissions beyond the scope of this license are available on request.