Building Self-Optimizing Interpreters with Truffle or RPython

A Hands-On Tutorial

Stefan Marr
Berne, Switzerland, December 2, 2014

Who Knows Smalltalk?

Hands up, please!

The Language

A minimal Smalltalk
for teaching of and research on
Virtual Machines.

(Not a Smalltalk-80)

The SOM Family

See som-st.github.io for the history.

A Smalltalk with File Syntax

Object = nil (
    class       = primitive

    =  other    = ( ^self == other )
    <> other    = ( ^(self = other) not )
    == other    = primitive
    ~= other    = (^ (self == other) not )
    isNil       = ( ^false )
    notNil      = ( ^true )

    "Evaluating"
    value       = ( ^self )
    
    "..."
)

Control Structures use polymorphism, closures, and non-local returns

Boolean = (
    ifTrue: trueBlock ifFalse: falseBlock = (
        self ifTrue:  [ ^trueBlock value  ].
        self ifFalse: [ ^falseBlock value ].
    )
)

True = Boolean (
    ifTrue:  block = ( ^block value )
    ifFalse: block = ( ^nil )
    "..."
)

False = Boolean (
    ifTrue:  block = ( ^nil )
    ifFalse: block = ( ^block value )
    "..."
)

Classes, `true`, `nil`, `false`,
and a few others are globals

System = (
  global: name            = primitive
  global: name put: value = primitive
  "..."
)

"and can be changed"
system global: #true   put: false.     "I wouldn't recommend that..."
system global: #Object put: nil.       "And this one neither..."

And, of course, there is JsSOM

SOM Shell.

--->

General Layout of a SOM Implementation

|
+- Examples       # A Hello World, and Benchmarks
+- Smalltalk      # The core classes
+- TestSuite      # A Test harness and unit tests
|
+- .travis.yml    # A Travis CI setup, detailing all dependencies
+- Makefile       # A generic Makefile, so that `make` works for all SOMs
+- src            # The interpreter implementation
+- som.sh         # A standard wrapper to start the interpreter

TruffleSOM

A SOM based on the Truffle Framework

Further Resources on Truffle and the Graal JIT Compiler

Oracle Labs: Graal VM and Truffle/JS
Truffle/JS is a JavaScript implemented with Truffle.
OpenJDK: Truffle FAQ and Guidelines
JKU Linz: The Truffle Language Implementation Framework
Incl. a list of Truffle-related publications.
JKU Linz: Graal and Truffle Builds

Switch to Eclipse

A deep dive into TruffleSOM

Package Structure

som.compiler - classic recursive decent parser
som.interpreter - interpreter implementation
som.interpreter.nodes.* - AST nodes and specializations
som.interpreter.objectstorage - object field type specializations
som.primitives - language primitives as AST nodes
som.vm.Universe - bootstrapping and object model
som.vm.Shell - a simple REPL
som.vmobjects - VM representation of SOM objects

Object Representation

Design Principles:
- Avoid Wrapping Objects
  - avoid unnecessary code/classes, and allocation
  - implicit primitive boxing makes it easy
- Simplicity and Uniformity (everything is a Java object)
  - Show ExpressionNode
  - SOM's Boolean maps on boolean
  - Integer maps on long and BigInteger
  - Double maps on double
  - Show AdditionPrim

TruffleSOM's Object Storage Model

Custom implementation
Few days ago, the framework got finally direct support:
An Object Storage Model for the Truffle Language Implementation Framework.
Show SObject

Has separate fields for primitives and objects

and extension arrays

Runtime object usage recorded per class (objectstorage)

Each object has layout, to allow lazy updates

#become: can be implemented easily

Primitive Implementation

Smalltalk: everything is a message send, incl. #+, #-,...
Integer only lists primitives
Show Integer
Primitives built from AST nodes
Show Primitives.constructPrimitive()
These are normal methods, found by message send
For AST-level specialization: eager primitive replacement
Show AbstractUninitializedMessageSendNode.specialize()
avoids message send overhead, enables better optimization

Truffle Framework

Frame classes
DSL
Assumptions
...

Hands-On Task

Optimize Pointer Inequality Test (`Object>>#~=`)

Write microbenchmark to measure the benefit
In Examples/Benchmarks/PointerInequality.som
Based on Loop.som: nil ~= true ifTrue: [ sum := sum + 1]
Execute benchmark
Create primitive node for inequality, based on EqualsEqualsPrim
Perhaps called NotEqualsEqualsPrim
Register primitive in ObjectPrimitives, next to "=="
Measure impact on interpreter speed
Add #~= to AbstractUninitializedMessageSendNode.specializeBinary() to avoid primitive method call.
Remeasure
Expected speedup >30%

Benchmarks

Benchmarks are subclasses of Benchmark
Examples/Benchmarks/BenchmarkHarness.som provides harness for execution
Harness takes three parameters
1. number of measurements
2. number of warmup iterations
3. problem size or inner iteration count
Execute benchmarks with interpreter:
./som.sh -cp Smalltalk Examples/Benchmarks/BenchmarkHarness.som PointerInequality 100 0 100

PointerInequality = Benchmark (

    singleRun = (
        | sum |
        sum := 0.
        1 to: 100 do: [ :j |
          nil ~= true ifTrue: [ sum := sum + 1] ].
        (sum = 100)
            ifFalse: [
                self error: 'Wrong result: ' + sum + ' should be: 100' ].
        ^ sum
    )

    benchmark = ( 
        | sum |
        sum := 0.
        1 to: 200 do: [ :i | sum := sum + self singleRun ].
        (sum = 20000) 
            ifFalse: [
                self error: 'Wrong result: ' + sum + ' should be: 20000' ]
    )
  )

package som.primitives;

import java.math.BigInteger;

import som.interpreter.nodes.nary.BinaryExpressionNode;
import som.vm.constants.Globals;
import som.vmobjects.SBlock;
import som.vmobjects.SInvokable;
import som.vmobjects.SObject;
import som.vmobjects.SSymbol;

import com.oracle.truffle.api.dsl.Specialization;

public abstract class NotEqualsEqualsPrim extends BinaryExpressionNode {

  @Specialization
  public final boolean doBoolean(final boolean left, final boolean right) {
    return left != right;
  }

  @Specialization
  public final boolean doBoolean(final boolean left, final SObject right) {
    return (!left && Globals.trueObject  == right) ||
            (left && Globals.falseObject == right);
  }

  @Specialization
  public final boolean doLong(final long left, final long right) {
    return left != right;
  }

  @Specialization
  public final boolean doBigInteger(final BigInteger left, final BigInteger right) {
    return left != right;
  }

  @Specialization
  public final boolean doString(final String left, final String right) {
    return left != right;
  }

  @Specialization
  public final boolean doDouble(final double left, final double right) {
    return left != right;
  }

  @Specialization
  public final boolean doSBlock(final SBlock left, final Object right) {
    return left != right;
  }

  @Specialization
  public final boolean doArray(final Object[] left, final Object right) {
    return left != right;
  }

  @Specialization
  public final boolean doSMethod(final SInvokable left, final Object right) {
    return left != right;
  }

  @Specialization
  public final boolean doSSymbol(final SSymbol left, final Object right) {
    return left != right;
  }

  @Specialization
  public final boolean doSObject(final SObject left, final Object right) {
    return left != right;
  }

  @Specialization
  public final boolean doLong(final long left, final double right) {
    return true;
  }

  @Specialization
  public final boolean doBigInteger(final BigInteger left, final long right) {
    return true;
  }

  @Specialization
  public final boolean doLong(final long left, final BigInteger right) {
    return true;
  }

  @Specialization
  public final boolean doDouble(final double left, final long right) {
    return true;
  }

  @Specialization
  public final boolean doLong(final long left, final String right) {
    return true;
  }

  @Specialization
  public final boolean doLong(final long left, final SObject right) {
    return true;
  }

  @Specialization
  public final boolean doString(final String receiver, final long argument) {
    return true;
  }

  @Specialization
  public final boolean doString(final String receiver, final SObject argument) {
    return true;
  }
}

Self-Optimization at Work (1/3)

Start IGV (Ideal Graph Visualizer)
~/-/graal$ ./mx.sh igv &
Save and revert local changes with git stash
Run ant to compile everything
Run PointerInequality microbenchmark with ./fast-igv.sh instead of ./som.sh
Investigate Method PointerInequality>>#$blockMethod in IGV
- Basic intro to Graph viewer
- Node properties, OptimizedDirectCall see callTarget

Self-Optimization at Work (2/3)

Investigate Optimization process
- Open last TruffleCompilerThread
- Brief overview of the different phases
- Typical compiler phases
- But, some specific to speculative optimization in JIT compilation
- Inlining on AST level
- Node exploration with focusing on particular nodes, hiding non-relevant stuff
Making sense of large graphs
- Node search function
- Try to find known node: search for known constants/literal values
- Focus on relevant nodes
- "Time line" to investigate impact of optimizations

Self-Optimization at Work (3/3)

Reapply changes and retry
- git stash pop
ant
rerun with ./fast-igv.sh

Hands-On Task: Control Structures

Add a `Block>>#repeatUntil:`

Implement the library version in Smalltalk/Block.som
Execute receiver once, and then use #whileFalse:
Can be tested, e.g., with extra code in Examples/Hello.som
Write microbenchmark to measure the benefit
Based on WhileLoop
Measure impact on interpreter speed
Create specialize node
Inspirations from: WhileWithStaticBlocksNode
Add #repeatUntil: to AbstractUninitializedMessageSendNode.specializeBinary() to avoid primitive method call.
Measure impact on interpreter speed
Why don't we see real benefit in interpreter?
Measure impact on compiled speed

Block = (
    "..."

    repeatUntil = (
      repeatUntil: aConditionBlock = (
        self value.
        aConditionBlock whileFalse: self
    )
)

Hello = (

    "The 'run' method is called when initializing the system"
    run = (
      | i |
      'Hello, World from SOM' println.
      
      i := 0.
      [ i := i + 1.
        i println.
      ] repeatUntil: [i = 10] )
    
)

RepeatUntilLoop = Benchmark (

    singleRun = (
        | sum |
        sum := 0.        
        [sum := sum + 1]
          repeatUntil:
            [sum = 1000]
        ^ sum
    )

    benchmark = ( 
        | sum |
        sum := 0.
        [sum := sum + self singleRun]
          repeatUntil:
            [sum = 20000].
        ^ sum
    )    
)

package som.interpreter.nodes.specialized.whileloops;

import som.interpreter.Invokable;
import som.interpreter.nodes.literals.BlockNode;
import som.interpreter.nodes.nary.BinaryExpressionNode;
import som.vm.constants.Nil;
import som.vmobjects.SBlock;

import com.oracle.truffle.api.CallTarget;
import com.oracle.truffle.api.CompilerAsserts;
import com.oracle.truffle.api.CompilerDirectives;
import com.oracle.truffle.api.Truffle;
import com.oracle.truffle.api.frame.VirtualFrame;
import com.oracle.truffle.api.nodes.DirectCallNode;
import com.oracle.truffle.api.nodes.Node;
import com.oracle.truffle.api.nodes.RootNode;
import com.oracle.truffle.api.source.SourceSection;


public class RepeatUntilNode extends BinaryExpressionNode {
  @Child protected BlockNode receiver;
  @Child protected BlockNode argument;

  @Child protected DirectCallNode conditionValueSend;
  @Child protected DirectCallNode bodyValueSend;

  public RepeatUntilNode(final BlockNode rcvrNode, final BlockNode argNode,
      final SBlock rcvr, final SBlock arg,
      final SourceSection source) {
    super(source);

    receiver = rcvrNode;
    CallTarget callTargetBody = rcvr.getMethod().getCallTarget();
    bodyValueSend = Truffle.getRuntime().createDirectCallNode(
        callTargetBody);

    argument = argNode;
    CallTarget callTargetCondition = arg.getMethod().getCallTarget();
    conditionValueSend = Truffle.getRuntime().createDirectCallNode(
        callTargetCondition);
  }

  @Override
  public final Object executeEvaluated(final VirtualFrame frame,
      final Object rcvr, final Object arg) {
    doLoop(frame, (SBlock) rcvr, (SBlock) arg);
    return Nil.nilObject;
  }

  private void doLoop(final VirtualFrame frame,
      final SBlock loopBody, final SBlock loopCondition) {
    long iterationCount = 0;

    try {
      boolean loopConditionResult;
      do  {
        bodyValueSend.call(frame, new Object[] {loopBody});
        loopConditionResult = (boolean) conditionValueSend.call(
            frame, new Object[] {loopCondition});

        if (CompilerDirectives.inInterpreter()) { iterationCount++; }
      } while (!loopConditionResult);
    } finally {
      if (CompilerDirectives.inInterpreter()) {
        reportLoopCount(iterationCount);
      }
    }
  }

  protected final void reportLoopCount(final long count) {
    CompilerAsserts.neverPartOfCompilation("reportLoopCount");
    Node current = getParent();
    while (current != null && !(current instanceof RootNode)) {
      current = current.getParent();
    }
    if (current != null) {
      ((Invokable) current).propagateLoopCountThroughoutLexicalScope(count);
    }
  }

  @Override
  public Object executeGeneric(final VirtualFrame frame) {
    SBlock rcvr = receiver.executeSBlock(frame);
    SBlock arg  = argument.executeSBlock(frame);
    return executeEvaluated(frame, rcvr, arg);
  }
}

What's Broken?

#repeatUntil: and dynamic blocks?
For optimal warmup behavior: communicate loop counters to compiler
Slows down interpreter

Anything else broken?

Pointer inequality?
Impact on language features?
Inheritance

In SOM: everything that's not tested is broken.
Everything that is not tested, is used for optimization!

RTruffleSOM

A SOM based on the Truffle-ideas and RPython

Further Resources on Meta-Tracing and RPython

RPython-related Papers
RPython documentation
JKU Linz: The Truffle Language Implementation Framework
Incl. a list of Truffle-related publications.
JKU Linz: Graal and Truffle Builds

Switch to PyCharm

A deep dive into RTruffleSOM

Package Structure - Based on TruffleSOM

som.compiler - classic recursive decent parser
som.interpreter - interpreter implementation
som.interpreter.nodes.* - AST nodes and specializations
som.interpreter.objectstorage - object field type specializations
som.primitives - language primitives as AST nodes
som.vm.Universe - bootstrapping and object model
som.vm.Shell - a simple REPL
som.vmobjects - VM representation of SOM objects
rtruffle - minimal framework for node specialization

Object Representation

Design Principles:
- Simplicity and Uniformity (everything is an AbstractObject)
  - RPython's type system, and absence of TruffleDSL make wrapping simpler
  - Show ExpressionNode
  - Show AbstractObject (ctrl+h)
  - All objects wrapped.

RTruffleSOM's Object Storage Model

Minimal, no primitive specialization
Show Object
Has direct fields and extension array (like TruffleSOM)
#become: can be implemented easily

Primitive Implementation

Smalltalk: everything is a message send, incl. #+, #-,...
Integer only lists primitives
Show Integer
Primitives implemented directly, no AST nodes
because it is easier without TruffleDSL
Show _asString()
These are normal methods, found by message send
No eager primitive replacement
Always send and then primitive activation

Other Differences

Generally, no type-based specializations
Focus on control structures, and library lowering
Need to compile VM
But, it is 'just' Python. RPython is strict subset of Python.

Testing doesn't require compilation:

../pypy/pytest.py tests/basic_interpreter_test.py
../pypy/pytest.py tests/

RPython

virtualizables
quasi immutables
...

Hands-On Task: `#~=` and `#repeatUntil:`

Implement Both Optimizations

Same approach, but some differences:

Inequality primitives goes into object_primitives.py, based on _equals(.,.,.)
#repeatUntil: should be based on while_node.py
Note: booleans are boxed, compare to self._universe.trueObject
You'll need to copy your changes from the SOM core-lib to RTruffleSOM's version
When benchmarking with ./som.sh, remember this is running on top of PyPy or CPython, and not as a standalone interpreter
Compile the interpreter with ./translate.sh interp

def _unequals(ivkbl, rcvr, args):
    op1 = args[0]
    op2 = rcvr
    if op1 is not op2:
        return ivkbl.get_universe().trueObject
    else:
        return ivkbl.get_universe().falseObject

from rpython.rlib import jit

from ..expression_node import ExpressionNode

from ....vmobjects.block  import Block
from ....vmobjects.method import Method


def get_printable_location(body_method, condition_method, node):
    assert isinstance(condition_method, Method)
    assert isinstance(body_method, Method)

    return "%s repeatUntil: %s" % (condition_method.merge_point_string(),
                                   body_method.merge_point_string())


repeat_driver = jit.JitDriver(
    greens=['body_method', 'condition_method', 'node'], reds='auto',
    get_printable_location = get_printable_location)


class RepeatUntilNode(ExpressionNode):

    _immutable_fields_ = ['_rcvr_expr?', '_body_expr?',
                          '_universe']
    _child_nodes_      = ['_rcvr_expr', '_body_expr']

    def __init__(self, rcvr_expr, body_expr, universe, source_section):
        ExpressionNode.__init__(self, source_section)
        self._rcvr_expr      = self.adopt_child(rcvr_expr)
        self._body_expr      = self.adopt_child(body_expr)
        self._universe       = universe

    def execute(self, frame):
        rcvr_value = self._rcvr_expr.execute(frame)
        body_block = self._body_expr.execute(frame)

        self._do_loop(rcvr_value, body_block)
        return self._universe.nilObject

    def execute_evaluated(self, frame, rcvr, args):
        self._do_loop(rcvr, args[0])
        return self._universe.nilObject

    def _do_loop(self, rcvr_block, body_block):
        body_method      = rcvr_block.get_method()
        condition_method = body_block.get_method()

        if rcvr_block.is_same_context(body_block):
            rcvr_block = body_block

        while True:
            repeat_driver.jit_merge_point(body_method     = body_method,
                                          condition_method= condition_method,
                                          node            = self)

            # STEFAN: looks stupid but might help the jit
            if rcvr_block is body_block:
                rcvr_block = body_block

            body_method.invoke(body_block, [])
            condition_value = condition_method.invoke(rcvr_block, [])
            if condition_value is self._universe.trueObject:
                break

    @staticmethod
    def can_specialize(selector, rcvr, args, node):
        sel = selector.get_string()
        return isinstance(rcvr, Block) and \
               isinstance(args[0], Block) and \
               (sel == "repeatUntil:")

    @staticmethod
    def specialize_node(selector, rcvr, args, node):
        return node.replace(
            RepeatUntilNode(node._rcvr_expr, node._arg_exprs[0],
                            node._universe, node._source_section))

Exploring the RPython Tooling

PyGame Trace Viewer

Open tests/jit.py

Add


    def test_unequals(self):
        cp = (py.path.local(__file__).dirpath().dirpath().join(
              "Smalltalk").strpath + ":" +
              py.path.local(__file__).dirpath().dirpath().join(
              "Examples/Benchmarks").strpath)
        self._eval_expr("""PointerInequality new benchmark""", cp)

execute ../pypy/pytest.py tests/jit.py -ktest_unequals -s
Ignore the first trace (hit ESC)
Second trace has a back-pointer, indicating the loop
Only increments and guards on counters are in the loop
Disabling the #~= optimization will result in more or less identical trace, with slightly different debug info, indicating the different SOM message sends
Note: traces are generated with different thresholds, usually resulting in more traces than in compiled version

Exploring the RPython Tooling

Traces from Compiled Version

Start compilation: ./translate.sh
Take a break (it takes 7min on my machine)
To get production traces: PYPYLOG=jit-log-opt,jit-backend:unequals.log ./RTruffleSOM-jit -cp Smalltalk Examples/Benchmarks/BenchmarkHarness.som PointerInequality 50 0 100
To get a visualization and the native code: ../pypy/rpython/jit/backend/tool/viewcode.py unequals.log
More tools in ../pypy/rpython/jit/tool/ but currently seemingly broken
To inspect traces also try jitviewer