Appendix: Performance Evaluation of RPySOM and TruffleSOM

The main question for our paper Are we there yet? Simple Language-Implementation Techniques for the 21st Century was whether these simple AST (TruffleSOM) or bytecode (RPySOM) interpreter implementations based on Truffle and RPython could reach performance of the same order of magnitude as for instance Java on top of a highly optimizing JVM.

In order to answer this question, we chose DeltaBlue and Richards as object-oriented benchmarks and Mandelbrot as a numerical one. All three are classic benchmarks that have been used to tune for instance JVMs, JavaScript VMs, PyPy, as well as Smalltalk VMs. Still, they are rather specific benchmarks that can given an indicate of what the expected performance of a VM is, but they are not predictors for the performance of concrete applications.

Methodology

The benchmarking methodology is kept as simple as possible. For each benchmark, we roughly determined when the VM reaches stable state and then executed it 100 times to account for non-deterministic influences such as caches and garbage collection. Each benchmark executes with a predefined problem size on each VM and the results are compared directly with each other.

The used benchmark machine has two quad-core Intel Xeons E5520, 2.26 GHz with 8 GB of memory and runs Ubuntu Linux with kernel 3.11. For a few more details of the machine, see the specification. The experimental setup and how to recreate it is briefly described in the general README.

Overall Performance

RPySOM and TruffleSOM reach more or less the goal, i.e., the same order of magnitude of performance as the Java Virtual Machine. RPySOM reaches a runtime of a factor 1.7 to 10.6, while TruffleSOM has more remaining optimization potential with being a factor 1.4 to 16 slower, while being faster than RPySOM on two of the three benchmarks. This performance is reached without requiring custom VMs and hundreds of person years of engineering. Thus, we conclude, that RPython as well as Truffle live up to the expectations.

plot of chunk unnamed-chunk-3

A look at the details for the three benchmarks shows that the performance varies widely depending on the benchmark, which indicates further optimization potential for specific code characteristics.

plot of chunk unnamed-chunk-4

As a further reference point, we did measurements also for the SOM++ interpreter, a SOM implemented in C++. It uses bytecodes and applies optimizations such as inline caching, threaded interpretation, and a generational garbage collector. However, it is 73 to 710x slower than Java. Since it is significantly slower, we did not include it in the charts because it would distort the impression. It is however included in the table below.

  Runtime in ms
  Java PyPy RPySOM TruffleSOM SOM++
Benchmark mean sd mean sd mean sd mean sd mean sd
DeltaBlue 118.49 29.01 605.73 9.89 200.67 0.45 170.73 45.20 8655.70 12.80
Mandelbrot 146.44 0.40 217.20 0.78 786.20 1.07 343.53 2.97 104033.22 184.23
Richards 159.74 10.86 647.71 2.63 1688.53 2.92 2556.76 72.72 87435.83 170.23

Microbenchmarks

In addition to DeltaBlue, Richards, and Mandelbrot, which can be considered macro or kernel benchmarks, we also use a number of more focused microbenchmarks to assess the performance of the SOM implementations. Since these are specific to SOM, we do not have numbers for the other languages.

plot of chunk unnamed-chunk-6

  Runtime in ms
  RPySOM TruffleSOM
Benchmark mean sd mean sd
Bounce 51.65 0.21 44.50 9.10
BubbleSort 56.27 0.25 40.27 1.75
Dispatch 32.59 0.19 35.36 2.59
Fannkuch 65.03 0.37 16.06 2.40
Fibonacci 328.13 2.18 95.59 3.00
FieldLoop 190.06 1.15 102.12 3.09
IntegerLoop 137.92 0.57 31.58 1.34
List 154.76 6.13 82.53 6.10
Loop 110.44 0.42 30.37 2.60
Permute 430.10 0.75 83.95 1.89
Queens 110.88 0.34 74.75 29.64
QuickSort 267.03 0.47 61.82 2.04
Recurse 135.94 0.33 69.71 1.89
Sieve 251.01 0.48 48.56 2.76
Storage 368.92 1.04 133.48 2.74
Sum 90.27 0.33 29.92 1.49
Towers 239.46 0.61 78.72 5.74
TreeSort 14512.53 9.05 47.55 2.12
WhileLoop 265.95 0.53 123.18 2.95