Building Self-Optimizing Interpreters with Truffle or RPython

A Hands-On Tutorial

Stefan Marr
Berne, Switzerland, December 2, 2014

Who Knows Smalltalk?

Smalltalk logo

Hands up, please!

The Language

A minimal Smalltalk
for teaching of and research on
Virtual Machines.

(Not a Smalltalk-80)

The SOM Family

 

See som-st.github.io for the history.

A Smalltalk with File Syntax

Object = nil (
    class       = primitive

    =  other    = ( ^self == other )
    <> other    = ( ^(self = other) not )
    == other    = primitive
    ~= other    = (^ (self == other) not )
    isNil       = ( ^false )
    notNil      = ( ^true )

    "Evaluating"
    value       = ( ^self )
    
    "..."
)

Control Structures use polymorphism, closures, and non-local returns

Boolean = (
    ifTrue: trueBlock ifFalse: falseBlock = (
        self ifTrue:  [ ^trueBlock value  ].
        self ifFalse: [ ^falseBlock value ].
    )
)

True = Boolean (
    ifTrue:  block = ( ^block value )
    ifFalse: block = ( ^nil )
    "..."
)

False = Boolean (
    ifTrue:  block = ( ^nil )
    ifFalse: block = ( ^block value )
    "..."
)

Classes, true, nil, false,
and a few others are globals

System = (
  global: name            = primitive
  global: name put: value = primitive
  "..."
)

"and can be changed"
system global: #true   put: false.     "I wouldn't recommend that..."
system global: #Object put: nil.       "And this one neither..."

And, of course, there is JsSOM

SOM Shell.
--->

General Layout of a SOM Implementation

|
+- Examples       # A Hello World, and Benchmarks
+- Smalltalk      # The core classes
+- TestSuite      # A Test harness and unit tests
|
+- .travis.yml    # A Travis CI setup, detailing all dependencies
+- Makefile       # A generic Makefile, so that `make` works for all SOMs
+- src            # The interpreter implementation
+- som.sh         # A standard wrapper to start the interpreter

TruffleSOM

A SOM based on the Truffle Framework

Further Resources on Truffle and the Graal JIT Compiler

Switch to Eclipse

A deep dive into TruffleSOM

Package Structure

  • som.compiler - classic recursive decent parser
  • som.interpreter - interpreter implementation
  • som.interpreter.nodes.* - AST nodes and specializations
  • som.interpreter.objectstorage - object field type specializations
  • som.primitives - language primitives as AST nodes
  • som.vm.Universe - bootstrapping and object model
  • som.vm.Shell - a simple REPL
  • som.vmobjects - VM representation of SOM objects

Object Representation

  • Design Principles:
    • Avoid Wrapping Objects
      • avoid unnecessary code/classes, and allocation
      • implicit primitive boxing makes it easy
    • Simplicity and Uniformity (everything is a Java object)
      • Show ExpressionNode
      • SOM's Boolean maps on boolean
      • Integer maps on long and BigInteger
      • Double maps on double
      • Show AdditionPrim
      • Array maps on Object[]
      • Show SArray

TruffleSOM's Object Storage Model

Primitive Implementation

Truffle Framework

  • Frame classes
  • DSL
  • Assumptions
  • ...

Hands-On Task

Optimize Pointer Inequality Test (Object>>#~=)

  1. Write microbenchmark to measure the benefit
    In Examples/Benchmarks/PointerInequality.som
    Based on Loop.som: nil ~= true ifTrue: [ sum := sum + 1]
  2. Execute benchmark
  3. Create primitive node for inequality, based on EqualsEqualsPrim
    Perhaps called NotEqualsEqualsPrim
  4. Register primitive in ObjectPrimitives, next to "=="
  5. Measure impact on interpreter speed
  6. Add #~= to AbstractUninitializedMessageSendNode.specializeBinary() to avoid primitive method call.
  7. Remeasure
    Expected speedup >30%

Benchmarks

  • Benchmarks are subclasses of Benchmark
  • Examples/Benchmarks/BenchmarkHarness.som provides harness for execution
  • Harness takes three parameters
    1. number of measurements
    2. number of warmup iterations
    3. problem size or inner iteration count
  • Execute benchmarks with interpreter:
    ./som.sh -cp Smalltalk Examples/Benchmarks/BenchmarkHarness.som PointerInequality 100 0 100
PointerInequality = Benchmark (

    singleRun = (
        | sum |
        sum := 0.
        1 to: 100 do: [ :j |
          nil ~= true ifTrue: [ sum := sum + 1] ].
        (sum = 100)
            ifFalse: [
                self error: 'Wrong result: ' + sum + ' should be: 100' ].
        ^ sum
    )

    benchmark = ( 
        | sum |
        sum := 0.
        1 to: 200 do: [ :i | sum := sum + self singleRun ].
        (sum = 20000) 
            ifFalse: [
                self error: 'Wrong result: ' + sum + ' should be: 20000' ]
    )
  )
package som.primitives;

import java.math.BigInteger;

import som.interpreter.nodes.nary.BinaryExpressionNode;
import som.vm.constants.Globals;
import som.vmobjects.SBlock;
import som.vmobjects.SInvokable;
import som.vmobjects.SObject;
import som.vmobjects.SSymbol;

import com.oracle.truffle.api.dsl.Specialization;

public abstract class NotEqualsEqualsPrim extends BinaryExpressionNode {

  @Specialization
  public final boolean doBoolean(final boolean left, final boolean right) {
    return left != right;
  }

  @Specialization
  public final boolean doBoolean(final boolean left, final SObject right) {
    return (!left && Globals.trueObject  == right) ||
            (left && Globals.falseObject == right);
  }

  @Specialization
  public final boolean doLong(final long left, final long right) {
    return left != right;
  }

  @Specialization
  public final boolean doBigInteger(final BigInteger left, final BigInteger right) {
    return left != right;
  }

  @Specialization
  public final boolean doString(final String left, final String right) {
    return left != right;
  }

  @Specialization
  public final boolean doDouble(final double left, final double right) {
    return left != right;
  }

  @Specialization
  public final boolean doSBlock(final SBlock left, final Object right) {
    return left != right;
  }

  @Specialization
  public final boolean doArray(final Object[] left, final Object right) {
    return left != right;
  }

  @Specialization
  public final boolean doSMethod(final SInvokable left, final Object right) {
    return left != right;
  }

  @Specialization
  public final boolean doSSymbol(final SSymbol left, final Object right) {
    return left != right;
  }

  @Specialization
  public final boolean doSObject(final SObject left, final Object right) {
    return left != right;
  }

  @Specialization
  public final boolean doLong(final long left, final double right) {
    return true;
  }

  @Specialization
  public final boolean doBigInteger(final BigInteger left, final long right) {
    return true;
  }

  @Specialization
  public final boolean doLong(final long left, final BigInteger right) {
    return true;
  }

  @Specialization
  public final boolean doDouble(final double left, final long right) {
    return true;
  }

  @Specialization
  public final boolean doLong(final long left, final String right) {
    return true;
  }

  @Specialization
  public final boolean doLong(final long left, final SObject right) {
    return true;
  }

  @Specialization
  public final boolean doString(final String receiver, final long argument) {
    return true;
  }

  @Specialization
  public final boolean doString(final String receiver, final SObject argument) {
    return true;
  }
}

Self-Optimization at Work (1/3)

  1. Start IGV (Ideal Graph Visualizer)
    ~/-/graal$ ./mx.sh igv &
  2. Save and revert local changes with git stash
  3. Run ant to compile everything
  4. Run PointerInequality microbenchmark with ./fast-igv.sh instead of ./som.sh
  5. Investigate Method PointerInequality>>#$blockMethod in IGV
    • Basic intro to Graph viewer
    • Node properties, OptimizedDirectCall see callTarget

Self-Optimization at Work (2/3)

  1. Investigate Optimization process
    • Open last TruffleCompilerThread
    • Brief overview of the different phases
    • Typical compiler phases
    • But, some specific to speculative optimization in JIT compilation
    • Inlining on AST level
    • Node exploration with focusing on particular nodes, hiding non-relevant stuff
  2. Making sense of large graphs
    • Node search function
    • Try to find known node: search for known constants/literal values
    • Focus on relevant nodes
    • "Time line" to investigate impact of optimizations

Self-Optimization at Work (3/3)

  1. Reapply changes and retry
    • git stash pop
  2. ant
  3. rerun with ./fast-igv.sh

Hands-On Task: Control Structures

Add a Block>>#repeatUntil:

  1. Implement the library version in Smalltalk/Block.som
    Execute receiver once, and then use #whileFalse:
  2. Can be tested, e.g., with extra code in Examples/Hello.som
  3. Write microbenchmark to measure the benefit
    Based on WhileLoop
  4. Measure impact on interpreter speed
  5. Create specialize node
    Inspirations from: WhileWithStaticBlocksNode
  6. Add #repeatUntil: to AbstractUninitializedMessageSendNode.specializeBinary() to avoid primitive method call.
  7. Measure impact on interpreter speed
  8. Why don't we see real benefit in interpreter?
  9. Measure impact on compiled speed
Block = (
    "..."

    repeatUntil = (
      repeatUntil: aConditionBlock = (
        self value.
        aConditionBlock whileFalse: self
    )
)

Hello = (

    "The 'run' method is called when initializing the system"
    run = (
      | i |
      'Hello, World from SOM' println.
      
      i := 0.
      [ i := i + 1.
        i println.
      ] repeatUntil: [i = 10] )
    
)
RepeatUntilLoop = Benchmark (

    singleRun = (
        | sum |
        sum := 0.        
        [sum := sum + 1]
          repeatUntil:
            [sum = 1000]
        ^ sum
    )

    benchmark = ( 
        | sum |
        sum := 0.
        [sum := sum + self singleRun]
          repeatUntil:
            [sum = 20000].
        ^ sum
    )    
)
package som.interpreter.nodes.specialized.whileloops;

import som.interpreter.Invokable;
import som.interpreter.nodes.literals.BlockNode;
import som.interpreter.nodes.nary.BinaryExpressionNode;
import som.vm.constants.Nil;
import som.vmobjects.SBlock;

import com.oracle.truffle.api.CallTarget;
import com.oracle.truffle.api.CompilerAsserts;
import com.oracle.truffle.api.CompilerDirectives;
import com.oracle.truffle.api.Truffle;
import com.oracle.truffle.api.frame.VirtualFrame;
import com.oracle.truffle.api.nodes.DirectCallNode;
import com.oracle.truffle.api.nodes.Node;
import com.oracle.truffle.api.nodes.RootNode;
import com.oracle.truffle.api.source.SourceSection;


public class RepeatUntilNode extends BinaryExpressionNode {
  @Child protected BlockNode receiver;
  @Child protected BlockNode argument;

  @Child protected DirectCallNode conditionValueSend;
  @Child protected DirectCallNode bodyValueSend;

  public RepeatUntilNode(final BlockNode rcvrNode, final BlockNode argNode,
      final SBlock rcvr, final SBlock arg,
      final SourceSection source) {
    super(source);

    receiver = rcvrNode;
    CallTarget callTargetBody = rcvr.getMethod().getCallTarget();
    bodyValueSend = Truffle.getRuntime().createDirectCallNode(
        callTargetBody);

    argument = argNode;
    CallTarget callTargetCondition = arg.getMethod().getCallTarget();
    conditionValueSend = Truffle.getRuntime().createDirectCallNode(
        callTargetCondition);
  }

  @Override
  public final Object executeEvaluated(final VirtualFrame frame,
      final Object rcvr, final Object arg) {
    doLoop(frame, (SBlock) rcvr, (SBlock) arg);
    return Nil.nilObject;
  }

  private void doLoop(final VirtualFrame frame,
      final SBlock loopBody, final SBlock loopCondition) {
    long iterationCount = 0;

    try {
      boolean loopConditionResult;
      do  {
        bodyValueSend.call(frame, new Object[] {loopBody});
        loopConditionResult = (boolean) conditionValueSend.call(
            frame, new Object[] {loopCondition});

        if (CompilerDirectives.inInterpreter()) { iterationCount++; }
      } while (!loopConditionResult);
    } finally {
      if (CompilerDirectives.inInterpreter()) {
        reportLoopCount(iterationCount);
      }
    }
  }

  protected final void reportLoopCount(final long count) {
    CompilerAsserts.neverPartOfCompilation("reportLoopCount");
    Node current = getParent();
    while (current != null && !(current instanceof RootNode)) {
      current = current.getParent();
    }
    if (current != null) {
      ((Invokable) current).propagateLoopCountThroughoutLexicalScope(count);
    }
  }

  @Override
  public Object executeGeneric(final VirtualFrame frame) {
    SBlock rcvr = receiver.executeSBlock(frame);
    SBlock arg  = argument.executeSBlock(frame);
    return executeEvaluated(frame, rcvr, arg);
  }
}

What's Broken?

  • #repeatUntil: and dynamic blocks?
  • For optimal warmup behavior: communicate loop counters to compiler
  • Slows down interpreter


Anything else broken?

  • Pointer inequality?
  • Impact on language features?
  • Inheritance


In SOM: everything that's not tested is broken.
Everything that is not tested, is used for optimization!

RTruffleSOM

A SOM based on the Truffle-ideas and RPython

Further Resources on Meta-Tracing and RPython

Switch to PyCharm

A deep dive into RTruffleSOM

Package Structure - Based on TruffleSOM

  • som.compiler - classic recursive decent parser
  • som.interpreter - interpreter implementation
  • som.interpreter.nodes.* - AST nodes and specializations
  • som.interpreter.objectstorage - object field type specializations
  • som.primitives - language primitives as AST nodes
  • som.vm.Universe - bootstrapping and object model
  • som.vm.Shell - a simple REPL
  • som.vmobjects - VM representation of SOM objects
     
  • rtruffle - minimal framework for node specialization

Object Representation

  • Design Principles:
    • Simplicity and Uniformity (everything is an AbstractObject)
      • RPython's type system, and absence of TruffleDSL make wrapping simpler
      • Show ExpressionNode
      • Show AbstractObject (ctrl+h)
      • All objects wrapped.

RTruffleSOM's Object Storage Model

  • Minimal, no primitive specialization
  • Show Object
  • Has direct fields and extension array (like TruffleSOM)
  • #become: can be implemented easily

Primitive Implementation

  • Smalltalk: everything is a message send, incl. #+, #-,...
  • Integer only lists primitives
  • Show Integer
     
  • Primitives implemented directly, no AST nodes
  • because it is easier without TruffleDSL
  • Show _asString()
  • These are normal methods, found by message send
     
  • No eager primitive replacement
  • Always send and then primitive activation

Other Differences

  • Generally, no type-based specializations
  • Focus on control structures, and library lowering
     
  • Need to compile VM
  • But, it is 'just' Python. RPython is strict subset of Python.
  • Testing doesn't require compilation:
    ../pypy/pytest.py tests/basic_interpreter_test.py
    ../pypy/pytest.py tests/

RPython

  • virtualizables
  • quasi immutables
  • ...

Hands-On Task: #~= and #repeatUntil:

Implement Both Optimizations

Same approach, but some differences:

  • Inequality primitives goes into object_primitives.py, based on _equals(.,.,.)
  • #repeatUntil: should be based on while_node.py
    Note: booleans are boxed, compare to self._universe.trueObject
  • You'll need to copy your changes from the SOM core-lib to RTruffleSOM's version
  • When benchmarking with ./som.sh, remember this is running on top of PyPy or CPython, and not as a standalone interpreter
  • Compile the interpreter with ./translate.sh interp
def _unequals(ivkbl, rcvr, args):
    op1 = args[0]
    op2 = rcvr
    if op1 is not op2:
        return ivkbl.get_universe().trueObject
    else:
        return ivkbl.get_universe().falseObject
    
from rpython.rlib import jit

from ..expression_node import ExpressionNode

from ....vmobjects.block  import Block
from ....vmobjects.method import Method


def get_printable_location(body_method, condition_method, node):
    assert isinstance(condition_method, Method)
    assert isinstance(body_method, Method)

    return "%s repeatUntil: %s" % (condition_method.merge_point_string(),
                                   body_method.merge_point_string())


repeat_driver = jit.JitDriver(
    greens=['body_method', 'condition_method', 'node'], reds='auto',
    get_printable_location = get_printable_location)


class RepeatUntilNode(ExpressionNode):

    _immutable_fields_ = ['_rcvr_expr?', '_body_expr?',
                          '_universe']
    _child_nodes_      = ['_rcvr_expr', '_body_expr']

    def __init__(self, rcvr_expr, body_expr, universe, source_section):
        ExpressionNode.__init__(self, source_section)
        self._rcvr_expr      = self.adopt_child(rcvr_expr)
        self._body_expr      = self.adopt_child(body_expr)
        self._universe       = universe

    def execute(self, frame):
        rcvr_value = self._rcvr_expr.execute(frame)
        body_block = self._body_expr.execute(frame)

        self._do_loop(rcvr_value, body_block)
        return self._universe.nilObject

    def execute_evaluated(self, frame, rcvr, args):
        self._do_loop(rcvr, args[0])
        return self._universe.nilObject

    def _do_loop(self, rcvr_block, body_block):
        body_method      = rcvr_block.get_method()
        condition_method = body_block.get_method()

        if rcvr_block.is_same_context(body_block):
            rcvr_block = body_block

        while True:
            repeat_driver.jit_merge_point(body_method     = body_method,
                                          condition_method= condition_method,
                                          node            = self)

            # STEFAN: looks stupid but might help the jit
            if rcvr_block is body_block:
                rcvr_block = body_block

            body_method.invoke(body_block, [])
            condition_value = condition_method.invoke(rcvr_block, [])
            if condition_value is self._universe.trueObject:
                break

    @staticmethod
    def can_specialize(selector, rcvr, args, node):
        sel = selector.get_string()
        return isinstance(rcvr, Block) and \
               isinstance(args[0], Block) and \
               (sel == "repeatUntil:")

    @staticmethod
    def specialize_node(selector, rcvr, args, node):
        return node.replace(
            RepeatUntilNode(node._rcvr_expr, node._arg_exprs[0],
                            node._universe, node._source_section))

Exploring the RPython Tooling

PyGame Trace Viewer

  • Open tests/jit.py
  • Add
    
        def test_unequals(self):
            cp = (py.path.local(__file__).dirpath().dirpath().join(
                  "Smalltalk").strpath + ":" +
                  py.path.local(__file__).dirpath().dirpath().join(
                  "Examples/Benchmarks").strpath)
            self._eval_expr("""PointerInequality new benchmark""", cp)
    
  • execute ../pypy/pytest.py tests/jit.py -ktest_unequals -s
  • Ignore the first trace (hit ESC)
  • Second trace has a back-pointer, indicating the loop
  • Only increments and guards on counters are in the loop
  • Disabling the #~= optimization will result in more or less identical trace, with slightly different debug info, indicating the different SOM message sends
  • Note: traces are generated with different thresholds, usually resulting in more traces than in compiled version

Exploring the RPython Tooling

Traces from Compiled Version

  • Start compilation: ./translate.sh
  • Take a break (it takes 7min on my machine)
     
  • To get production traces: PYPYLOG=jit-log-opt,jit-backend:unequals.log ./RTruffleSOM-jit -cp Smalltalk Examples/Benchmarks/BenchmarkHarness.som PointerInequality 50 0 100
  • To get a visualization and the native code: ../pypy/rpython/jit/backend/tool/viewcode.py unequals.log
  • More tools in ../pypy/rpython/jit/tool/ but currently seemingly broken
  • To inspect traces also try jitviewer