After the paper was completed, I went on to make big plans, falling into the old trap of simply assuming that I roughly knew where the interpreters spend their time, and based on these assumptions developed grand ideas to make them faster. None of these ideas would be easy of course, requiring possibly month of work. Talking to my colleagues working on the Graal compiler, I was very kindly reminded that I should know and not guess were the execution time goes. I remember hearing that before, probably when I told my students the same thing…
So, here we are. This blog post has two goals: document how to build the various Truffle interpreters and how to use VTune for future me, and discuss a bit the findings of where Truffle bytecode interpreters spent much of their time.
To avoid focusing on implementation-specific aspects, I’ll look at four different Truffle-based bytecode interpreters. I’ll look at my own trusty TruffleSOM, Espresso (a JVM on Truffle), GraalPy, and GraalWasm.
To keep things brief, below I’ll just give the currently working magic incantations that produce ahead-of-time-compiled binaries for the bytecode interpreters, as well as how to run them as pure interpreters using the classic Mandelbrot benchmark.
Let’s start with TruffleSOM. The following is the command line to build all dependencies and the bytecode interpreter. It also compiles it in a way that just-in-time compilation is disabled, something which the other implementations take as a command-line flag. The result is a binary in the same folder.
TruffleSOM$ mx build-native --build-native-image-tool --build-trufflesom --no-jit --type BC
Espresso is part of the Graal repository and all necessary build settings are conveniently maintained as part of the repository. To find the folder with the binary, we can use the second command:
espresso$ mx --env native-ce build
espresso$ mx --env native-ce graalvm-home
GraalPy is in its own repository, though can also be built similarly. It also prints conveniently the path where we find the result.
graalpy$ mx python-svm
Last but not least, GraalWasm takes a little more convincing to get the same result. Here the configuration isn’t included in the repository.
wasm$ export DYNAMIC_IMPORTS=/substratevm,/sdk,/truffle,/compiler,/wasm,/tools
wasm$ export COMPONENTS=cmp,cov,dap,gvm,gwa,ins,insight,insightheap,lg,lsp,pro,sdk,sdkl,tfl,tfla,tflc,tflm
wasm$ export NATIVE_IMAGES=lib:jvmcicompiler,lib:wasmvm
wasm$ export NON_REBUILDABLE_IMAGES=lib:jvmcicompiler
wasm$ export DISABLE_INSTALLABLES=False
wasm$ mx build
wasm$ mx graalvm-home
At this point, we have our four interpreters ready for use.
As benchmark, I’ll use the Are We Fast Yet version of Mandelbrot. This is mostly for my own convenience. For this experiment we just need a benchmark that mostly runs inside the bytecode loop, and Mandelbrot will do a good-enough job with that.
TruffleSOM and Python take care of compilation implicitly, but for Java and Wasm, we need to produce the jar and wasm files ourselves. For Wasm, I used the C++ version of the benchmark and Emscripten to compile it.
Java$ ant jar # creates a benchmarks.jar
C++$ CXX=em++ OPT='-O3 -sSTANDALONE_WASM' build.sh
The -sSTANDALONE_WASM
flag makes sure Emscripten gives us a wasm module
that works without further issues on GraalWasm.
Executing the Mandelbrot benchmark is now relatively straightforward. Though, I’ll skip over the full path details below. For Espresso, GraalPy, and GraalWasm, we use the command-line flags to disable just-in-time compilation as follows.
Note the executables are the ones we built above. Running for instance
java --version
should show something like the following:
openjdk 21.0.2 2024-01-16
OpenJDK Runtime Environment GraalVM CE 21.0.2-dev+13.1 (build 21.0.2+13-jvmci-23.1-b33)
Espresso 64-Bit VM GraalVM CE 21.0.2-dev+13.1 (build 21-espresso-24.1.0-dev, mixed mode)
With this, running the benchmarks uses roughly the following commands:
som-native-interp-bc -cp Smalltalk Harness.som Mandelbrot 10 500
java --experimental-options --engine.Compilation=false -cp benchmarks.jar Harness Mandelbrot 10 500
graalpy --experimental-options --engine.Compilation=false ./harness.py Mandelbrot 10 500
wasm --experimental-options --engine.Compilation=false ./harness-em++-O3-sSTANDALONE_WASM Mandelbrot 10 500
There are various useful profilers out there, though, my colleagues specifically asked me to have a look at VTune, and I figured, it might be a convenient way to grab various hardware details from an Intel CPU.
However, I do not have direct access to an Intel workstation. So, instead of using the VTune desktop user interface or command line, I’ll actually use the VTune server on one of our benchmarking machines. This was surprisingly convenient and seems useful for rerunning previous experiments with different settings or binaries.
The machine is suitably protected, but I can’t recommend to use the following in an open environment:
vtune-server --web-port $PORT --enable-server-profiling --reset-passphrase
This prints the URL where the web interface can be accessed, and is configured so that we can run experiments directly from the interface, which helps with finding the various interesting options.
For all four interpreters, I’ll focus on what VTune calls the Hotspots profiling runs. I used the Hardware Event-Based Sampling setting with additional performance insights.
After it finished running the benchmark, VTune opens a Summary of the results with various statistics. Though, for this investigation most interesting is the Bottom-up view of where the program spent its time. For all four interpreters, the top function is the bytecode loop.
Opening the top function allows us to view the assembly, and group by Basic Block / Address. This neatly adds up the time of each instruction in a basic block, and gives us an impression of how much time we spent in each block.
VTune gives us a convenient way to identify which parts of the compiled code are executed and how much time we spent in it. What surprised me is that about 50% of all time is spent in bytecode dispatch. Not the bytecode operation itself, no, but the code executed for every single bytecode leading up to and including the dispatch.
Below is the full code of the “bytecode dispatch” for GraalWasm.
As far as I can see, all four interpreters have roughly the same native code
structure. It starts with a very long sequence of instructions that likely
read various bits out of Truffle’s VirtualFrame
objects, and then proceeds
to do the actual dispatch via what the Graal compiler calls an IntegerSwitchNode
in its intermediate representation,
for which the RangeTableSwitchOp
strategy is used for compilation.
This encodes the bytecode dispatch by looking up the
jump target in a table, and then performing the jmp %rdx
instruction at the
very end of the code below.
Address | Assembly | CPU Time |
---|---|---|
Block 18 | 11.963s | |
0x1f5b229 |
xorpd %xmm0, %xmm0 |
0.306s |
0x1f5b22d |
mov $0xfffffffffffff, %rdx |
0.486s |
0x1f5b237 |
movq %rdx, 0x1f0(%rsp) |
0.080s |
0x1f5b23f |
mov $0xeae450, %rdx |
0.070s |
0x1f5b249 |
mov %r8d, %ebx |
0.346s |
0x1f5b24c |
movzxb 0x10(%r10,%rbx,1), %ebx |
0.391s |
0x1f5b252 |
movq 0x18(%rdi), %rbp |
0.050s |
0x1f5b256 |
movq %rbp, 0xd0(%rsp) |
0.030s |
0x1f5b25e |
movq 0x10(%rdi), %rbp |
0.256s |
0x1f5b262 |
movq %rbp, 0xc8(%rsp) |
0.441s |
0x1f5b26a |
lea (%r14,%rbp,1), %rsi |
0.256s |
0x1f5b26e |
movq %rsi, 0xc0(%rsp) |
0.020s |
0x1f5b276 |
lea 0xe(%r8), %esi |
0.611s |
0x1f5b27a |
lea 0xd(%r8), %ebp |
0.356s |
0x1f5b27e |
lea 0xc(%r8), %edi |
0.135s |
0x1f5b282 |
lea 0xb(%r8), %r11d |
0.055s |
0x1f5b286 |
lea 0xa(%r8), %r13d |
0.306s |
0x1f5b28a |
movl %r13d, 0x1ec(%rsp) |
0.401s |
0x1f5b292 |
lea 0x9(%r8), %r12d |
0.125s |
0x1f5b296 |
movl %r12d, 0x1e8(%rsp) |
0.065s |
0x1f5b29e |
lea 0x8(%r8), %edx |
0.326s |
0x1f5b2a2 |
lea 0x7(%r8), %ecx |
0.416s |
0x1f5b2a6 |
movl %esi, 0x1e4(%rsp) |
0.115s |
0x1f5b2ad |
lea 0x6(%r8), %esi |
0.030s |
0x1f5b2b1 |
movl %ebp, 0x1e0(%rsp) |
0.311s |
0x1f5b2b8 |
lea 0x5(%r8), %ebp |
0.356s |
0x1f5b2bc |
movl %edi, 0x1dc(%rsp) |
0.150s |
0x1f5b2c3 |
lea 0x4(%r8), %edi |
0.040s |
0x1f5b2c7 |
movl %r11d, 0x1d8(%rsp) |
0.321s |
0x1f5b2cf |
lea 0x3(%r8), %r11d |
0.416s |
0x1f5b2d3 |
movl %r11d, 0x1d4(%rsp) |
0.135s |
0x1f5b2db |
lea 0x2(%r8), %r13d |
0.065s |
0x1f5b2df |
movl %r13d, 0x1d0(%rsp) |
0.276s |
0x1f5b2e7 |
mov %r8d, %r12d |
0.426s |
0x1f5b2ea |
inc %r12d |
0.125s |
0x1f5b2ed |
movl %r12d, 0x1cc(%rsp) |
0.045s |
0x1f5b2f5 |
movl %edx, 0x1c8(%rsp) |
0.321s |
0x1f5b2fc |
mov %r9d, %edx |
0.441s |
0x1f5b2ff |
inc %edx |
0.120s |
0x1f5b301 |
movl %edx, 0x1c4(%rsp) |
0.040s |
0x1f5b308 |
lea -0x3(%r9), %edx |
0.366s |
0x1f5b30c |
movl %edx, 0x1c0(%rsp) |
0.311s |
0x1f5b313 |
lea -0x2(%r9), %edx |
0.221s |
0x1f5b317 |
movl %edx, 0x1bc(%rsp) |
0.045s |
0x1f5b31e |
lea 0xf(%r8), %edx |
0.376s |
0x1f5b322 |
mov %r9d, %r8d |
0.366s |
0x1f5b325 |
dec %r8d |
0.085s |
0x1f5b328 |
movl %r8d, 0x1b8(%rsp) |
0.045s |
0x1f5b330 |
movl %edx, 0x1b4(%rsp) |
0.371s |
0x1f5b337 |
mov %ebx, %r9d |
0.481s |
0x1f5b33a |
cmp $0xfe, %r9d |
0.035s |
0x1f5b341 |
jnbe 0x1f7c658 |
|
Block 19 | 0.982s | |
0x1f5b347 |
lea 0xa(%rip), %rdx |
0.035s |
0x1f5b34e |
movsxdl (%rdx,%r9,4), %r9 |
0.321s |
0x1f5b352 |
add %r9, %rdx |
0.506s |
0x1f5b355 |
jmp %rdx |
0.120s |
Someone who writes bytecode interpreters directly in assembly might be mortified by this code. Though, to me this is more of an artifact of some missed optimization opportunities in the otherwise excellent Graal compiler, which hopefully can be fixed.
I won’t include the results for the other interpreters here, but to summarize, let’s count the instructions of the bytecode dispatch for each of them:
For GraalPy it is even a little more complex. There are various other basic blocks involved and none of the bytecode handler jump back directly to the same block. Instead there seems to be some more code after each handler before they jump back to the top of the loop.
As mentioned for TruffleSOM, I did already look into one optimization opportunity. The very careful reader might have noticed the end of block 18 above.
Address | Assembly | CPU Time |
---|---|---|
0x1f5b33a |
cmp $0xfe, %r9d |
0.035s |
0x1f5b341 |
jnbe 0x1f7c658 |
This is a correctness check for the switch
/case
statement in the Java code.
It makes sure that the value we switch over is covered by the cases in the
switch. Otherwise, we’re jumping to some default block, or just back to the
top of the loop.
The implementation in Graal is a little bit too hard-coded for my taste. While it has all the logic to eliminate unreachable cases, for instance when it sees that the value we switch over is guaranteed to exclude some cases, it does handle the default case directly in the machine-specific lowering code.
Seems like one could generalize this a little more and possibly handle the default case like the other cases. The relevant Graal issue for this micro-optimization is #8425. However, when I applied this to TruffleSOM’s version of Graal, and eliminated those two instructions, it didn’t make a difference. The remaining 31 instructions still dominate the bytecode dispatch.
The most important take away message here is of course know, don’t assume, or more specifically measure, don’t guess.
For the problem at hand, it looks like Graal struggles with hoisting some common reads out of the bytecode loops. If there’s a way to fix this, this could give a massive speedup to all Truffle-based bytecode interpreters, perhaps enough to invalidate our AST vs. Bytecode Interpreters paper. Wouldn’t that be fun?! 🤓 🧑🏻🔬
The mentioned micro optimization would also avoid a few instructions for every switch
/case
in normal Java code, when it doesn’t need the default case. So, it might be relevant for more than just bytecode interpreters.
For questions, comments, or suggestions, please find me on Twitter @smarr or Mastodon.
After turning my notes into this blog post yesterday, I figured today I should also look at what RPython is doing for my PySOM bytecode interpreter.
The below 13 instructions are the bytecode dispatch
for the interpreter. While it is much shorter, it also contains the
safety check cmp $0x45, %al
to make sure the bytecode is within the set of
targets. Ideally, a bytecode verifier would have ensure that already,
or perhaps we setup the native code so that there are simply no unsafe jump
targets to avoid having to check every time, which at least based on VTune
seems to consume a considerable amount of the overall run time.
Also somewhat concerning is that the check is done twice.
Block 21 already has a (Correction: the second check was unrelated, and I managed to remove it by applying an optimization I had already on TruffleSOM, but not yet in PySOM.)cmp $0x45, %rax
, which should make the second test
unnecessary.
So, yeah, I guess PySOM could be a bit faster on every single bytecode, which might mean PyPy could possibly also improve its interpreted performance.
Address | Assembly | CPU Time |
---|---|---|
Block 20 | 0.085s | |
0xb45d8 |
cmp %r14, %rcx |
0.085s |
0xb45db |
jle 0xb6ba0 |
|
Block 21 | 0.631s | |
0xb45e1 |
movzxb 0x10(%rdx,%r14,1), %eax |
0.311s |
0xb45e7 |
cmp $0x45, %rax |
0.321s |
0xb45eb |
jnle 0xb6c00 |
|
Block 22 | 4.601s | |
0xb45f1 |
lea 0x69868(%rip), %rdi |
0.571s |
0xb45f8 |
movq 0x10(%rdi,%rax,8), %r15 |
0.020s |
0xb45fd |
add %r14, %r15 |
2.942s |
0xb4600 |
cmp $0x45, %al |
1.068s |
0xb4602 |
jnbe 0xb6c59 |
|
Block 23 | 0.476s | |
0xb4608 |
movsxdl (%rbx,%rax,4), %rax |
0s |
0xb460c |
add %rbx, %rax |
0.015s |
0xb460f |
jmp %rax |
0.461s |
The above plot is based on the performance of 9 microbenchmarks and 5 slightly larger benchmarks, which were design to study the effectiveness of compilers.
My goal with this poll is to see what the general performance expectations for various language implementations are.
For questions, comments, or suggestions, please find me on Twitter @smarr or Mastodon.
Thanks for playing!
]]>My goal is to understand which concrete “guarantees” the GIL gives in both versions, which “guarantees” it does not give, and which ones one might assume based on testing and observation. I am putting “guarantees” in quotes, because with a future no-GIL Python, none of the discussed properties should be considered language guarantees.
While Python has various implementations, including CPython, PyPy, Jython, IronPython, and GraalPy, I’ll focus on CPython as the most widely used implementation. Though, PyPy and GraalPy also use a GIL, but their implementations subtly differ from CPython’s, as we will see a little later.
Let’s recap a bit of background. When CPython started to support multiple operating system threads, it became necessary to protect various CPython-internal data structures from concurrent access. Instead of adding locks or using atomic operations to protect the correctness of for instance reference counting, the content of lists, dictionaries, or internal data structures, the CPython developers decided to take a simpler approach and use a single global lock, the GIL, to protect all of these data structures from incorrect concurrent accesses. As a result, one can start multiple threads in CPython, though only a single of them runs Python bytecode at any given time.
The main benefit of this approach is its simplicity and single-threaded performance. Because there’s only a single lock to worry about, it’s easy to get the implementation correct without risking deadlocks or other subtle concurrency bugs at the level of the CPython interpreter. Thus, the GIL represented a suitable point in the engineering trade-off space between correctness and performance.
Of course, the obvious downside of this design is that only a single thread can execute Python bytecode at any given time. I am talking about Python bytecode here again, because operations that may take a long time, for instance reading a file into memory, can release the GIL and allow other threads to run in parallel.
For programs that spend most of their time executing Python code, the GIL is of course a huge performance bottleneck, and thus, PEP 703 proposes to make the GIL optional. The PEP mentions various use cases, including machine learning, data science, and other numerical applications.
So far, I only mentioned that the GIL is there to protect CPython’s internal data structures from concurrent accesses to ensure correctness. However, when writing Python code, I am more interested in the “correctness guarantees” the GIL gives me for the concurrent code that I write. To know these “correctness guarantees”, we need to delve into the implementation details of when the GIL is acquired and released.
The general approach is that a Python thread obtains the GIL when it starts executing Python bytecode. It will hold the GIL as long as it needs to and eventually release it, for instance when it is done executing, or when it is executing some operation that often would be long-running and itself does not require the GIL for correctness. This includes for instance the aforementioned file reading operation or more generally any I/O operation. However, a thread may also release the GIL when executing specific bytecodes.
This is where Python 3.9 and 3.13 differ substantially. Let’s start with Python 3.13, which I think roughly corresponds to what Python has been doing since version 3.10 (roughly since this PR). Here, the most relevant bytecodes are for function or method calls as well as bytecodes that jump back to the top of a loop or function. Thus, only a few bytecodes check whether there was a request to release the GIL.
In contrast, in Python 3.9 and earlier versions,
the GIL is released at least in some situations by almost all bytecodes.
Only a small set of bytecodes including stack operations,
LOAD_FAST
, LOAD_CONST
, STORE_FAST
, UNARY_POSITIVE
, IS_OP
, CONTAINS_OP
, and JUMP_FORWARD
do not
check whether the GIL should be released.
These bytecodes all use the CHECK_EVAL_BREAKER()
on 3.13
(src)
or DISPATCH()
on 3.9 (src),
which eventually checks (3.13, 3.9)
whether another thread requested the GIL to be released by setting the GIL_DROP_REQUEST
bit in the interpreter’s state.
What makes “atomicity guarantees” more complicated to reason about is that this bit is set by threads waiting for the GIL
based on a timeout (src).
The timeout is specified by sys.setswitchinterval()
.
In practice, what does this mean?
For Python 3.13, this should mean that a function that contains only bytecodes
that do not lead to a CHECK_EVAL_BREAKER()
check should be atomic.
For Python 3.9, this means a very small set of bytecode sequences can be atomic, though, except for a tiny set of specific cases, one can assume that a bytecode sequence is not atomic.
However, since the Python community is taking steps that may lead to the removal of the GIL, the changes in recent Python versions to give much stronger atomicity “guarantees” are likely a step in the wrong direction for the correctness of concurrent Python code. I mean this in the sense of people to accidentally rely on these implementation details, leading to hard to find concurrency bugs when running on a no-GIL Python.
Thanks to @cfbolz, I have at least one very concrete example of code that someone assumed to be atomic:
request_id = self._next_id
self._next_id += 1
The code tries to solve a classic problem: we want to hand out unique request ids, but it breaks when multiple threads execute this code at the same time, or rather interleaved with each other. Because then we end up getting the same id multiple times. This concrete bug was fixed by making the reading and incrementing atomic using a lock.
On Python 3.9, we can relatively easily demonstrate the issue:
def get_id(self):
# expected to be atomic: start
request_id = self.id
self.id += 1
# expected to be atomic: end
self.usage[request_id % 1_000] += 1
Running this on multiple threads will allow us to observe an inconsistent number of usage
counts.
They should be all the same, but they are not.
Arguably, it’s not clear whether the observed atomicity issue is from the request_id
or the usage
counts,
but the underlying issue is the same in both cases.
For the full example see 999_example_bug.py.
This repository contains a number of other examples that demonstrate the difference between different Python implementations and versions.
Generally, on Python 3.13 most bytecode sequences without function calls will be atomic. On Python 3.9, much few are, and I believe that would be better to avoid people from creating code that relies on the very strong guarantees that Python 3.13 gives.
As mentioned earlier, because the GIL is released based on a timeout, one may also perceive bytecode sequences as atomic when experimenting.
Let’s assume we run the following two functions on threads in parallel:
def insert_fn(list):
for i in range(100_000_000):
list.append(1)
list.append(1)
list.pop()
list.pop()
return True
def size_fn(list):
for i in range(100_000_000):
l = len(list)
assert l % 2 == 0, f"List length was {l} at attempt {i}"
return True
Depending on how fast the machine is, it may take 10,000 or more iterations of the loop
in size_fn
before we see the length of the list to be odd.
This means it takes 10,000 iterations before the function calls to append or pop allowed
the GIL to be released before the second append(1)
or after the first pop()
.
Without looking at the CPython source code, one might have concluded easily that these bytecode sequences are atomic.
Though, there’s a way to make it visible earlier.
By setting the thread switch interval to a very small value,
for instance with sys.setswitchinterval(0.000000000001)
,
one can observe an odd list length after only a few or few hundred iterations of the loop.
In my gil-sem-demos repository, I have a number of examples that try to demonstrate observable differences in GIL behavior.
Of course, the very first example tries to show the performance benefit of running multiple Python threads in parallel. Using the no-GIL implementation, one indeed sees the expected parallel speedup.
On the other tests, we see the major differences between Python 3.8 - 3.9 and the later 3.10 - 3.13 versions. The latter versions usually execute the examples without seeing results that show a bytecode-level atomicity granularity. Instead, they suggest that loop bodies without function calls are pretty much atomic.
For PyPy and GraalPy, it is also harder to observe the bytecode-level atomicity granularity, because they are simply faster. Lowering the switch interval makes it a little more observable, except for GraalPy, which likely aggressively removes the checks for whether to release the GIL.
Another detail for the no-GIL implementation: it crashes for our earlier bug example.
It complains about *** stack smashing detected ***
.
A full log is available as a gist.
In this blog post, I looked into the implementation details of CPython’s Global Interpreter Lock (GIL). The semantics between Python 3.9 and 3.13 differ substantially. Python 3.13 gives much stronger atomicity “guarantees”, releasing the GIL basically only on function calls and jumps back to the top of a loop or function.
If the Python community intends to remove the GIL, this seems problematic. I would expect more people to implicitly rely on these much stronger guarantees, whether consciously or not.
My guess would be that this change was done mostly in an effort to improve the single-threaded performance of CPython.
To enable people to test their code on these versions closer to semantics that match a no-GIL implementation, I would suggest to add a compile-time option to CPython that forces a GIL release and thread switch after bytecodes that may trigger behavior visible to other threads. This way, people would have a chance to test on a stable system that is closer to the future no-GIL semantics and probably only minimally slower at executing unit tests.
Any comments, suggestions, or questions are also greatly appreciated perhaps on Twitter @smarr or Mastodon.
]]>Writing this post, I realized that I have been working on interpreters on top of metacompilation systems for 10 years now. And in much of that time, it felt to me that widely held beliefs about interpreters did not quite match my own experience.
In summer 2013, I started to explore new directions for my research after finishing my PhD just in January that year. The one question that was still bugging me back then was how I could show that my PhD ideas of using metaobject protocols to realize different concurrency models was not just possible, but perhaps even practical.
During my PhD, I worked with a bytecode interpreter written in C++ and now had the choice to spend the next few years writing a just-in-time compiler for this thing, which is never going to be able to keep up with what state-of-the-art VMs offered, or finding another way to get good performance.
To my surprise, there was another way. Though even today, some say that metacompilation is ludicrous, and will never be practical. And others, simply use PyPy as their secret sauce and enjoy better Python performance…
Though, I didn’t start with PyPy, or rather its RPython metacompilation framework. Instead, I found the Truffle framework with its Graal just-in-time compiler and implemented the Simple Object Machine (SOM) on top of it, a dynamic language I had been using for a while already. Back then, Truffle and Graal were very new, and it took a while to get my TruffleSOM to reach the desired “state-of-the-art performance”, but I got there eventually. On the way to reaching that point, I also implemented PySOM on top of PyPy’s RPython framework. This was my way to get a better understanding of metacompilation more generally But enough of history…
If you’re Google, you can afford to finance your own JavaScript virtual machine with a state-of-the-art interpreter, state-of-the-art garbage collector, and no less than three different just-in-time compilers, which of course are also “state of the art”. Your average academic, and even large language communities such as those around Ruby, Python, and PHP do not have the resources to build such VMs.1 1 Of course, reality is more complicated, but I’ll skip over it.
This is where metacompilation comes in and promises us that we can reuse existing compilers, garbage collectors, and all the other parts of an existing high-level language VM. All we need to do is implement our language as an interpreter on top of something like the GraalVM or RPython systems.
That’s the promise. And, for a certain set of benchmarks, and a certain set of use cases, these systems deliver exactly that: reuse of these components. We still have to implement an interpreter suitable for these systems though. While this is no small feat, it’s something “an average” academic can do with enough time and stubbornness, and there are plenty of examples using Truffle and RPython.
And my SOM implementation, indeed manages to hold its own compared to Google’s V8:
As we can see here, both V8 inside of Node.js and TSOMAST, which is short for the abstract-syntax-tree-based TruffleSOM interpreter, are roughly in the same range of being 1.7× to 2.7× slower than the HotSpot JVM. SOM as a dynamic language similar to JavaScript, and easily within a range of ±50% of performance to V8, I’d argue that metacompilation indeed lives up to its promise.
Of course, the used Are We Fast Yet benchmarks test only a relatively small common part of these languages, but they show that the core language elements common to Java, JavaScript, and other object-oriented languages reach roughly the same level of performance.
However, we wanted to talk about abstract-syntax-tree (AST) and bytecode interpreters. The goal of our work was to investigate the difference between these two approaches to build interpreters on top of GraalVM and RPython.
To this end, we had to implement no less than four interpreters, two AST interpreters and two bytecode interpreters, one each on the two different systems. We also had to optimize them roughly to the same level, so that we can draw conclusions from the comparison. While AST and bytecode interpreters naturally lend themselves to different optimizations, we didn’t stop at what’s commonly done. Instead, we implemented the classic optimizations to gain the key performance benefits, and once we hit diminishing returns, we added the optimizations from the AST interpreters to the bytecode ones or the other way around, so that they roughly implement the same set of optimizations. Currently, this includes (see Section 4 of the paper for more details):
Some other classic interpreter optimizations were not directly possible on top of the metacompilation systems. This includes indirect threading and top-of-stack caching for bytecode interpreters. Well, more precisely, we experimented a little with both, but they didn’t give the desired benefits and rather slowed the interpreters down, and we only made them work on a few benchmarks. Pushing this further will require extensive changes to the metacompilation systems as far as we can tell from talking to people working on GraalVM and RPython. So, future work…
With all these optimizations, we reached a point where adding further optimizations showed only minimal gains, typically specific to a benchmark, which gives us some confidence that our results are meaningful.
The final results are as follows:
Since we are confident that both types of interpreters are optimized roughly to the same level, we conclude that bytecode interpreters do not have their traditional advantage on top of metacompilation systems when it comes to pure interpreter speed.
Based on our current understanding, we attribute that to some general challenges metacompilation systems face when producing native code for the bytecode interpreter loops. The GraalVM for instance, does use the Graal IR, which only supports structured control flow. This means, we cannot directly encode arbitrary jumps between bytecodes, and the compiler also struggles with the size of bytecode loops, leaving some optimization opportunities on the table. The bytecode loops of a standard interpreter can be multiple tens or hundreds of kilobytes of native code, where bytecode interpreters written in C/C++ are typically much more concise.
We also looked at memory use, and indeed, on that metric bytecode interpreters win compared to AST interpreters. However, the difference is not as stark as one might expect, and bytecode interpreters may also require more allocations based on how they are structured for boxing or other run-time data structures.
Though, this blog post is just a high-level overview. For the details, please see the paper below.
Based on our results, what would I do, if I had to implement another language on top of Truffle or RPython? Well, as always, it really depends… Let’s look at the two ends of the spectrum I can think of:
An Established, widely used Language: For this, I’d assume that a bytecode has been defined, and it is going to evolve slowly in the future. I’d also assume it has possibly large existing programs with a lot of code, i.e., hundreds of thousands of lines of code. For these types of languages, I’d suggest to stick with the bytecode. Bytecode is fast to load and lots of code is likely executed only once or not at all. This means the memory benefits of the compact representation likely outweigh other aspects, and we get decent performance.
A Completely new Language: Here I will assume the language will first need to find is way and design may frequently change. On top of metacompilation systems, we can get really good performance, and there’s not a lot of code for our language, and our users are unlikely to care too much about loading huge amounts of code. Here, AST interpreters in the Truffle style are likely the more flexible choice. You don’t need to design a bytecode, and can instead focus on the language and getting to acceptable performance first. Later, once you have larger code bases with their own performance challenges one may still think about designing a bytecode, but I would think the fewest languages will ever get there.
For languages in between these two ends of the spectrum, one would probably want to weigh up engineering effort, which I’d think to be lower for AST interpreters, and memory use for large code bases, where bytecode interpreters are better.
Our paper includes more on the background, our interpreter implementations, and our experimental design. It also has a more in-depth analysis on the performance properties we observed in terms of run time and memory use. To guide the work of language implementers, we also looked at the various optimizations, to see which ones are most important to gain interpreter as well as just-in-time compiled performance.
As artifact, our paper comes with a Docker image that includes all experiments and raw data, which hopefully enables others to reproduce our results and expand on or compare to our work. The Dockerfile itself is also on GitHub, where one can also find the latest versions of TruffleSOM and PySOM. Though these two don’t yet contain the same versions as the paper, and may evolve further.
The paper is the result of a collaboration with Octave, Humphrey, and Sophie.
Any comments or suggestions are also greatly appreciated perhaps on Twitter @smarr or Mastodon.
Abstract
Thanks to partial evaluation and meta-tracing, it became practical to build language implementations that reach state-of-the-art peak performance by implementing only an interpreter. Systems such as RPython and GraalVM provide components such as a garbage collector and just-in-time compiler in a language-agnostic manner, greatly reducing implementation effort. However, meta-compilation-based language implementations still need to improve further to reach the low memory use and fast warmup behavior that custom-built systems provide. A key element in this endeavor is interpreter performance. Folklore tells us that bytecode interpreters are superior to abstract-syntax-tree (AST) interpreters both in terms of memory use and run-time performance.
This work assesses the trade-offs between AST and bytecode interpreters to verify common assumptions and whether they hold in the context of meta-compilation systems. We implemented four interpreters, each an AST and a bytecode one using RPython and GraalVM. We keep the difference between the interpreters as small as feasible to be able to evaluate interpreter performance, peak performance, warmup, memory use, and the impact of individual optimizations.
Our results show that both systems indeed reach performance close to Node.js/V8. Looking at interpreter-only performance, our AST interpreters are on par with, or even slightly faster than their bytecode counterparts. After just-in-time compilation, the results are roughly on par. This means bytecode interpreters do not have their widely assumed performance advantage. However, we can confirm that bytecodes are more compact in memory than ASTs, which becomes relevant for larger applications. However, for smaller applications, we noticed that bytecode interpreters allocate more memory because boxing avoidance is not as applicable, and because the bytecode interpreter structure requires memory, e.g., for a reified stack.
Our results show AST interpreters to be competitive on top of meta-compilation systems. Together with possible engineering benefits, they should thus not be discounted so easily in favor of bytecode interpreters.
@article{Larose:2023:AstVsBc, abstract = {Thanks to partial evaluation and meta-tracing, it became practical to build language implementations that reach state-of-the-art peak performance by implementing only an interpreter. Systems such as RPython and GraalVM provide components such as a garbage collector and just-in-time compiler in a language-agnostic manner, greatly reducing implementation effort. However, meta-compilation-based language implementations still need to improve further to reach the low memory use and fast warmup behavior that custom-built systems provide. A key element in this endeavor is interpreter performance. Folklore tells us that bytecode interpreters are superior to abstract-syntax-tree (AST) interpreters both in terms of memory use and run-time performance. This work assesses the trade-offs between AST and bytecode interpreters to verify common assumptions and whether they hold in the context of meta-compilation systems. We implemented four interpreters, each an AST and a bytecode one using RPython and GraalVM. We keep the difference between the interpreters as small as feasible to be able to evaluate interpreter performance, peak performance, warmup, memory use, and the impact of individual optimizations. Our results show that both systems indeed reach performance close to Node.js/V8. Looking at interpreter-only performance, our AST interpreters are on par with, or even slightly faster than their bytecode counterparts. After just-in-time compilation, the results are roughly on par. This means bytecode interpreters do not have their widely assumed performance advantage. However, we can confirm that bytecodes are more compact in memory than ASTs, which becomes relevant for larger applications. However, for smaller applications, we noticed that bytecode interpreters allocate more memory because boxing avoidance is not as applicable, and because the bytecode interpreter structure requires memory, e.g., for a reified stack. Our results show AST interpreters to be competitive on top of meta-compilation systems. Together with possible engineering benefits, they should thus not be discounted so easily in favor of bytecode interpreters.}, appendix = {https://doi.org/10.5281/zenodo.8147414}, author = {Larose, Octave and Kaleba, Sophie and Burchell, Humphrey and Marr, Stefan}, blog = {https://stefan-marr.de/2023/10/ast-vs-bytecode-interpreters/}, doi = {10.1145/3622808}, html = {https://stefan-marr.de/papers/oopsla-larose-et-al-ast-vs-bytecode-interpreters-in-the-age-of-meta-compilation/}, issn = {2475-1421}, journal = {Proceedings of the ACM on Programming Languages}, keywords = {AST Bytecode CaseStudy Comparison Interpreter JITCompilation MeMyPublication MetaTracing PartialEvaluation myown}, month = oct, number = {OOPSLA2}, pages = {1--31}, pdf = {https://stefan-marr.de/downloads/oopsla23-larose-et-al-ast-vs-bytecode-interpreters-in-the-age-of-meta-compilation.pdf}, publisher = {{ACM}}, series = {OOPSLA'23}, title = {AST vs. Bytecode: Interpreters in the Age of Meta-Compilation}, volume = {7}, year = {2023}, month_numeric = {10} }
Humphrey started his research work wanting to make profilers produce more directly actionable suggestions where and how to optimize programs. Though, relatively quickly we noticed that sampling profilers are not only probabilistic as one would expect, but can give widely different results between runs, which do not necessarily converge with many runs either.
In 2010, Mytkowicz et al. identified safepoint bias as a key issue for sampling profilers for Java programs. Though, their results were not quite as bad as what we were seeing, so Humphrey started to design experiments to characterize the precision and accuracy of Java profilers in more detail.
Just before getting start, we’re fully aware that this isn’t a new issue and there are quite a number of great and fairly technical blogs out there discussing a large range of issues, for instance here, here, here, and here. In our work, we will only look at fully deterministic and small pure Java benchmarks to get a better understanding of what the current situation is.
What’s the issue you may ask? Well, let’s look at an example. Figure 1 shows the profiling results of Java Flight Recorder over 30 runs on the DeltaBlue benchmark. We see 8 different methods being identified as hottest method indicated by the hatched bars in red.
Of course much of this could probably be explained with the non-determinism inherent to JVMs such as HotSpot: just-in-time compilation, parallel compilation, garbage collection, etc. However, we run each benchmark not only for 30 times but also long enough to be fully compiled. So, we basically give the profiler and JVM a best-case scenario.1 1 At least to the degree that is practical. Though, benchmarking is hard, and there are many things going on in modern VMs. See also Tratt’s posts on the topic: 1, 2 And what do we get as a result? No clear indication where to start optimizing our application. However, if we would have looked at only a single profile, we may have started optimizing something that is rarely the bit of code the application spends most time on.
Fortunately, this is indeed the worst case we found.
Overall, we looked at async-profiler, Honest Profiler, Java Flight Recorder, JProfiler, perf, and YourKit. Figure 2 shows box plots for each of these profilers to indicate the range of differences we found between the minimum and maximum run-time percentage reported for each method over all benchmarks. Thus, the median isn’t too bad, but each profiler shows cases where there is more than 15% difference between some of the runs.
The paper goes into much more detail analyzing the results by comparing profilers with themselves and among each other to be able to characterize accuracy and precision without knowing a ground truth. It also includes plots that show how the results are distributed for specific methods to identify possible sources of the observed variation.
So, for all the details, please see the paper linked below. Any pointers and suggestions are also greatly appreciated perhaps on Twitter @smarr or Mastodon.
Abstract
To identify optimisation opportunities, Java developers often use sampling profilers that attribute a percentage of run time to the methods of a program. Even so these profilers use sampling, are probabilistic in nature, and may suffer for instance from safepoint bias, they are normally considered to be relatively reliable. However, unreliable or inaccurate profiles may misdirect developers in their quest to resolve performance issues by not correctly identifying the program parts that would benefit most from optimisations.
With the wider adoption of profilers such as async-profiler and Honest Profiler, which are designed to avoid the safepoint bias, we wanted to investigate how precise and accurate Java sampling profilers are today. We investigate the precision, reliability, accuracy, and overhead of async-profiler, Honest Profiler, Java Flight Recorder, JProfiler, perf, and YourKit, which are all actively maintained. We assess them on the fully deterministic Are We Fast Yet benchmarks to have a stable foundation for the probabilistic profilers.
We find that profilers are relatively reliable over 30 runs and normally report the same hottest method. Unfortunately, this is not true for all benchmarks, which suggests their reliability may be application-specific. Different profilers also report different methods as hottest and cannot reliably agree on the set of top 5 hottest methods. On the positive side, the average run time overhead is in the range of 1% to 5.4% for the different profilers.
Future work should investigate how results can become more reliable, perhaps by reducing the observer effect of profilers by using optimisation decisions of unprofiled runs or by developing a principled approach of combining multiple profiles that explore different dynamic optimisations.
@inproceedings{Burchell:2023:Profilers, abstract = {To identify optimisation opportunities, Java developers often use sampling profilers that attribute a percentage of run time to the methods of a program. Even so these profilers use sampling, are probabilistic in nature, and may suffer for instance from safepoint bias, they are normally considered to be relatively reliable. However, unreliable or inaccurate profiles may misdirect developers in their quest to resolve performance issues by not correctly identifying the program parts that would benefit most from optimisations. With the wider adoption of profilers such as async-profiler and Honest Profiler, which are designed to avoid the safepoint bias, we wanted to investigate how precise and accurate Java sampling profilers are today. We investigate the precision, reliability, accuracy, and overhead of async-profiler, Honest Profiler, Java Flight Recorder, JProfiler, perf, and YourKit, which are all actively maintained. We assess them on the fully deterministic Are We Fast Yet benchmarks to have a stable foundation for the probabilistic profilers. We find that profilers are relatively reliable over 30 runs and normally report the same hottest method. Unfortunately, this is not true for all benchmarks, which suggests their reliability may be application-specific. Different profilers also report different methods as hottest and cannot reliably agree on the set of top 5 hottest methods. On the positive side, the average run time overhead is in the range of 1% to 5.4% for the different profilers. Future work should investigate how results can become more reliable, perhaps by reducing the observer effect of profilers by using optimisation decisions of unprofiled runs or by developing a principled approach of combining multiple profiles that explore different dynamic optimisations.}, acceptancerate = {0.54}, appendix = {https://github.com/HumphreyHCB/AWFY-Profilers}, author = {Burchell, Humphrey and Larose, Octave and Kaleba, Sophie and Marr, Stefan}, blog = {https://stefan-marr.de/2023/09/dont-blindly-trust-your-profiler/}, booktitle = {Proceedings of the 20th ACM SIGPLAN International Conference on Managed Programming Languages and Runtimes}, doi = {10.1145/3617651.3622985}, keywords = {CPUSampling Comparison MeMyPublication Precision Profiling myown}, month = oct, pages = {1--14}, pdf = {https://stefan-marr.de/downloads/mplr23-burchell-et-al-dont-trust-your-profiler.pdf}, publisher = {ACM}, series = {MPLR'23}, title = {{Don’t Trust Your Profiler: An Empirical Study on the Precision and Accuracy of Java Profilers}}, year = {2023}, month_numeric = {10} }
For my own lectures, I decided to take an approach that focused more on the high-level ideas and can introduce a wider audience to how we build interpreters and a range of techniques for just-in-time compilation.
Of course, I also wanted to talk a little bit about our own work. Thus, both lectures come with the strong bias of meta-compilation systems. My interpreter lecture is informed by our upcoming OOPSLA paper, which shows that in the context of meta-compilation systems, abstract-syntax-tree interpreters are doing surprisingly well compared to bytecode interpreters.
My lecture on just-in-time compilation of course also went into how meta-compilation works and how it enables us to build languages that can reach state-of-the-art performance by compiling a user program through our interpreters. While it’s still a lot of work, the big vision is that one day, we might just define the grammar, provide a few extra details of how the language is to be executed, and then some kind of toolchain gives us a language runtime that executes user programs with state-of-the-art performance.
One can still dream… 🤓
When preparing these lectures, I was also looking back at the lectures I gave in 2019 for a summer school at Dagstuhl. Perhaps, this material will at some point form its own course on Virtual Machines. Another of those dreams…
I have to admit, the original abstracts don’t quite represent the final lectures. So, I’ll also include the outlines in addition to the slides.
Implementers often start with an interpreter to sketch how a language may work. They are easy to implement and great to experiment with. However, they are also an essential part of dynamic language implementations. We will talk about the basics of abstract syntax trees, bytecodes, and how these ideas can be used to implement a language. We will also look into optimizations for interpreters: how AST and bytecode interpreters can use run-time feedback to improve performance, and discuss how super nodes and super instructions allows us to make effective use of modern CPUs.
Since the early days of object-oriented languages, run-time polymorphism has been a challenge for implementers. Smalltalk and Self pushed many ideas to an extreme, their implementers had to invent techniques such as: lookup caches, tracing and method-based compilation, deoptimization, and maps. While these ideas originated in the ’80s and ‘90s, they are key ingredients of today’s just-in-time compilers for Java, Ruby, Python, JavaScript.
If you have any questions, I am more than happy to answer, possibly on Twitter @smarr or Mastodon.
]]>Earlier this year, Wanhong Huang, Tomoharu Ugawa, and myself published some new experiments on interpreter performance. We experimented with a Genetic Algorithm to squeeze a little more performance out of bytecode interpreters. Since I spent much of my research time looking for ways to improve interpreter performance, I was quite intrigued by the basic question behind Wanhong’s experiments: which is the best order of bytecode handlers in the interpreter loop?
Let’s start with a bit of background. Many of today’s widely used interpreters use bytecodes, which represent a program as operations quite similar to processor instructions. Though, depending on the language we are trying to support in our interpreter, the bytecodes can be arbitrarily complex, in terms of how they encode arguments, but also in terms of the behavior they implement.
In the simplest case, we would end up with an interpreter loop that looks roughly like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
uint8_t bytecode = { push_local, push_local, add, /* ... */ }
while (true) {
uint8_t bytecode = bytecodes[index];
index += /* ... */;
switch (bytecode) {
case push_local:
// ...
case add:
// ...
case call_method:
// ...
}
}
switch
/case
Interpreter Loop.Here, push_local
and add
are much simpler than any call_method
bytecode.
Depending on the language that we try to implement, push_local
is likely just a few processor instructions, while call_method
might be significantly more complex, because it may need to lookup the method, ensure that arguments are passed correctly, and ensure that we have memory for local variables for the method that is to be executed.
Since bytecodes can be arbitrarily complex,
S. Brunthaler
distinguished between high abstraction-level interpreters
and low abstraction-level ones.
High abstraction-level interpreters
do not spend a lot of time on the bytecode dispatch, but low abstraction-level
ones do, because their bytecodes are comparably simple, and have often just a handful of processor instructions. Thus, low abstraction-level interpreters
would benefit most from optimizing the bytecode dispatch.
A classic optimization of the bytecode dispatch is threaded code interpretation, in which we represent a program not only using bytecodes, but with an additional array of jump addresses. This optimization is also often called direct threaded code. It is particularly beneficial for low abstraction-level interpreters but applied more widely.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
uint8_t bytecode = { push_local, push_local, add, /* ... */ }
void* targets = { &&push_local, &&push_local, &add, /* ... */ }
push_local:
// ...
void* target = targets[index];
index += /* ... */
goto *target;
add:
// ...
void* target = targets[index];
index += /* ... */
goto *target;
call_method:
// ...
void* target = targets[index];
index += /* ... */
goto *target;
With this interpreter optimization, we do not have the explicit while
loop.
Instead, we have goto
labels for each bytecode handler
and each handler has a separate copy of the dispatch code, that is the goto
jump instruction.
This helps modern processors in at least two ways:
it avoids an extra jump to the top of the loop at the end of each bytecode,
and perhaps more importantly, we have multiple dispatch points instead of a single one.
This is important for branch prediction.
In our first loop, a processor would not be able to predict
where a jump is going, because the switch
normally translates to a single jump
that goes to most, if not all bytecodes.
Though, when we have the jump at the end of a bytecode handler,
we may only see a subset of bytecodes,
which increases the chance that the processor can predict the jump correctly.
Unfortunately, modern processors are rather complex. They have limits for instance for how many jump targets they can remember. They may also end up combining the history for different jump instructions, perhaps because they use an associative cache based on the address of the jump instruction. They have various different caches, including the instruction cache, into which our interpreter loop ideally fits for best performance. And all these things may interact in unexpected ways.
For me, these things make it pretty hard to predict the performance of a bytecode loop for a complex language such as JavaScript or Ruby.
And if it’s too hard to understand, why not try some kind of machine learning. And that’s indeed what Wanhong and Tomoharu came up with.
With all the complexity of modern processors, Wanhong and Tomoharu observed that changing the order of bytecode handlers can make a significant difference for the overall performance of an interpreter. Of course, this will only make a difference if our interpreter indeed spends a significant amount of time in the bytecode loop and the handler dispatch itself.
When looking at various interpreters, we will find most of them use a natural order. With that I mean, the bytecode handlers are in the order of the numbers assigned to each bytecode. Other possible orders could be a random order, or perhaps even an order based on the frequency of bytecodes or the frequency of bytecode sequences. Thus, one might simply first have the most frequently used bytecodes, and then less frequently used ones, perhaps hoping that this means the most used instructions fit into caches, or help the branch predictor in some way.
The goal of our experiments was to find out whether we can use a Genetic Algorithm to find a better ordering so that we can improve interpreter performance. We use a genetic algorithm to create new orders of bytecode handlers by producing a crossover from two existing orderings that combine both with a few handlers being reordered additionally, which adds mutation into the new order. The resulting bytecode handler order is then compile into a new interpreter for which we measure the run time of a benchmark. With a Genetic Algorithm, one can thus generate variations of handler orders that over multiple generations of crossover and mutation may evolve to a faster handler order.
I’ll skip the details here, but please check out the paper below for the specifics.
So, how well does this approach work? To find out, we applied it to the eJSVM, a JavaScript interpreter that is designed for resource-constraint devices.
In the context of resource-constraint embedded devices, it may make sense to tailor an interpreter to a specific applications to gain best performance. Thus we started by optimizing the interpreter for a specific benchmark on a specific machine. To keep the time needed for the experiments manageable, we used three Intel machines and one Raspberry Pi with an ARM CPU. In many ways, optimizing for a specific benchmark is the best-case scenario, which is only practical if we can deploy a specific application together with the interpreter. Figure 1 shows the results on benchmarks from the Are We Fast Yet benchmark suite. We can see that surprisingly large improvements. While the results depend very much on the processor architecture, every single benchmark sees an improvement on all platforms.
Unfortunately, we can’t really know which programs user will run on our interpreters for all scenarios. Thus, we also looked at how interpreter speed improves when we train the interpreter on a single benchmark. Figure 2 shows how the performance changes when we train the interpreter for a specific benchmark. In the top left corner, we see the results when training for the Bounce benchmark. While Bounce itself sees a 7.5% speedup, the same interpreter speeds up the List benchmark by more than 12%. Training the interpreter on the Permute benchmark gives how ever much less improvements for the other benchmarks.
In the paper, we look at a few more aspects including which Genetic Algorithm works best and how portable performance is between architectures.
Reading this blog post, you may wonder how to best go about experimenting with your own interpreter. We also briefly tried optimizing CRuby, however, we unfortunately did not yet manage to find time to continue, but we found a few things that one needs to watch out for when doing so.
First, you may have noticed that we used a relatively old versions of GCC.
For eJSVM, these gave good results and did not interfere with our reordering.
However, on CRuby and with newer GCCs, the compiler will start to reorder
basic blocks itself, which makes it harder to get the desired results.
Here flags such as -fno-reorder-blocks
or -fno-reorder-blocks-and-partition
may be needed.
Clang didn’t seem to reorder basic blocks in the interpreter loop.
As a basic test of how big a performance impact might be,
I simply ran a handful of random bytecode handler orders, which I would normally would expect to show some performance difference, likely a slowdown. Though, for CRuby I did not see a notable performance change, which suggests that bytecode dispatch may not be worth optimizing further. But it’s a bit early to tell conclusively at this point. We should give CPython and others a go, but haven’t gotten around to it just yet.
If you care about interpreter performance, maybe it’s worth to take a look at the interpreter loop and see whether modern processors deliver better performance when bytecode handlers get reordered.
Our results suggest that it can give large improvements when training for a specific benchmark. There is also still a benefit for other benchmarks that we did not train for, though, it depends on how similar the training benchmark is to the others.
For more details, please read the paper linked below, or reach out on Twitter @smarr.
Abstract
Interpreter performance remains important today. Interpreters are needed in resource constrained systems, and even in systems with just-in-time compilers, they are crucial during warm up. A common form of interpreters is a bytecode interpreter, where the interpreter executes bytecode instructions one by one. Each bytecode is executed by the corresponding bytecode handler.
In this paper, we show that the order of the bytecode handlers in the interpreter source code affects the execution performance of programs on the interpreter. On the basis of this observation, we propose a genetic algorithm (GA) approach to find an approximately optimal order. In our GA approach, we find an order optimized for a specific benchmark program and a specific CPU.
We evaluated the effectiveness of our approach on various models of CPUs including x86 processors and an ARM processor. The order found using GA improved the execution speed of the program for which the order was optimized between 0.8% and 23.0% with 7.7% on average. We also assess the cross-benchmark and cross-machine performance of the GA-found order. Some orders showed good generalizability across benchmarks, speeding up all benchmark programs. However, the solutions do not generalize across different machines, indicating that they are highly specific to a microarchitecture.
@inproceedings{Huang:2023:GA, abstract = {Interpreter performance remains important today. Interpreters are needed in resource constrained systems, and even in systems with just-in-time compilers, they are crucial during warm up. A common form of interpreters is a bytecode interpreter, where the interpreter executes bytecode instructions one by one. Each bytecode is executed by the corresponding bytecode handler. In this paper, we show that the order of the bytecode handlers in the interpreter source code affects the execution performance of programs on the interpreter. On the basis of this observation, we propose a genetic algorithm (GA) approach to find an approximately optimal order. In our GA approach, we find an order optimized for a specific benchmark program and a specific CPU. We evaluated the effectiveness of our approach on various models of CPUs including x86 processors and an ARM processor. The order found using GA improved the execution speed of the program for which the order was optimized between 0.8% and 23.0% with 7.7% on average. We also assess the cross-benchmark and cross-machine performance of the GA-found order. Some orders showed good generalizability across benchmarks, speeding up all benchmark programs. However, the solutions do not generalize across different machines, indicating that they are highly specific to a microarchitecture.}, author = {Huang, Wanhong and Marr, Stefan and Ugawa, Tomoharu}, blog = {https://stefan-marr.de/2023/06/squeezing-a-little-more-performance-out-of-bytecode-interpreters/}, booktitle = {The 38th ACM/SIGAPP Symposium on Applied Computing (SAC '23)}, doi = {10.1145/3555776.3577712}, isbn = {978-1-4503-9517-5/23/03}, keywords = {Bytecodes CodeLayout EmbeddedSystems GeneticAlgorithm Interpreter JavaScript MeMyPublication Optimization myown}, month = mar, pages = {10}, pdf = {https://stefan-marr.de/downloads/acmsac23-huang-et-al-optimizing-the-order-of-bytecode-handlers-in-interpreters-using-a-genetic-algorithm.pdf}, publisher = {ACM}, series = {SAC'23}, title = {{Optimizing the Order of Bytecode Handlers in Interpreters using a Genetic Algorithm}}, year = {2023}, month_numeric = {3} }
As part of her research, Sophie wrote a paper investigating the behavior of method call sites in detail. She looked at how well optimizations such as lookup caches, target duplicate elimination, and splitting apply to modern Ruby code. I’ll use the work here as a foundation and zoom in on the Rails apps we looked at. For all details including the measurement methodology, I’ll defer to sec. 3 of the paper. It also discusses how Sophie instrumented TruffleRuby and how the data was processed.
The benchmarks I am going to be focusing on are called BlogRails, ERubiRails, LiquidCartRender, and LiquidRenderBibs. BlogRails, usually referred to as railsbench, is a small Ruby on Rails application, simulating a basic blog, as created by Rails’ scaffold generator. The benchmark accesses existing blog posts and creates new ones. The ERubiRails is a similarly small Rails app and renders an ERB template from the Discourse project.
I also included two Liquid template language benchmarks here out of curiosity. LiquidCartRender uses Liquid to render an HTML page for a shopping cart. LiquidRenderBibs renders an HTML page with a list of papers that have a variety of different data bits to be shown (specifically this one here).
Poly. and | Used | Poly. and | ||||||
---|---|---|---|---|---|---|---|---|
Statement | Function | Calls | Megamorphic | Call | Megamorphic | |||
Benchmark | Statements | Coverage | Functions | Coverage | (in 1000) | Calls | Sites | Call Sites |
BlogRails | 118,717 | 48% | 37,595 | 38% | 13,863 | 7.4% | 52,361 | 2.3% |
ERubiRails | 117,922 | 45% | 37,328 | 35% | 12,309 | 5.4% | 47,794 | 2.3% |
LiquidCartRender | 23,562 | 39% | 6,269 | 30% | 236 | 5.5% | 3,581 | 2.4% |
LiquidRenderBibs | 23,277 | 39% | 6,185 | 29% | 385 | 23.4% | 3,466 | 2.8% |
As the table above shows, the Rails benchmarks have about 120,000 Ruby statements each, of which 45-48% are executed. Of the circa 37,500 functions, about 35-38% are executed. In total, the BlogRails benchmark makes about 13,863,000 function calls. 7.4% of these calls are polymorphic or megamorphic.
In Ruby, a call site is considered to be monomorphic, if there is a single receiver class seen during execution, which also means there’s usually a single method that is being called. When there is more than one different receiver type, we call the call site polymorphic. Once there were more than a certain number of receiver types, a call site is megamorphic. In TruffleRuby, this happens when more than 8 different receiver types were used at the call site. Though, this is a bit of a simplification, and we’ll get into more details in the next section.
Until then we can observer that ERubiRails seems a bit less polymorphic. Only 5.4% of its calls are polymorphic or megamorphic.
The Liquid benchmarks are much smaller, with only about 23,500 statements in about 6,200 functions. The number of calls being between 236,000 and 385,000 is also significantly smaller. Surprisingly, about 23% of all calls in the LiquidRenderBibs benchmark are polymorphic. While I haven’t looked into it in more detail, I would assume that this might be an artifact of the template having to handle a large number of differences in the input data.
Compared to other languages, these numbers do not feel too different. For instance, the Dacapo Con Scala project found somewhat similar numbers for Java and Scala. In the Scala benchmarks they looked at, 89.7% of all calls were monomorphic. The Java benchmarks had about 91.5% of all calls being monomorphic.
This means what we see for Rails is roughly in line with what one would expect. This is good news, because it means that the classic optimizations are likely going to work as expected.
But before getting too enthusiastic, let’s dig a little deeper to see whether that is indeed the case.
Let’s take a very simple, Rails-like example as a starting point.
The following code shows the ApplicationController
defining
the status
method, which simply returns an HTTP status code.
We also define an ArticlesController
as subclass of ApplicationController
.
The ArticlesController
implements the index
method, which for brevity is kept empty.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
class ApplicationController
def status
200
end
end
class ArticlesController < ApplicationController
def index
end
end
controllers = [
ArticlesController.new,
ApplicationController.new
]
controllers.select { |c| c.status == 200 }
At the end of the example on line 16, we have an array with both controllers,
and select the ones with the status code being 200.
The call to the status
method is receiver-polymorphic.
This means the call site sees multiple different receiver types,
in our case the two controllers.
Though, at the same time, the call site is target-monomorphic.
This means, there’s only a single method that is activated.
TruffleRuby optimizes this case by using two polymorphic inline caches or more accurately dispatch chains, one after another, as depicted in fig. 1.
By using two dispatch chains, a language implementation can often turn a receiver-polymorphic call site into a target-monomorphic one. The first dispatch chain acts as classic lookup cache. It takes the receiver type1 1 Since TruffleRuby uses object shapes to optimize objects, they would be used as a proxy for the receiver type. and caches the method as the result of a lookup. The second cache deduplicates the target methods and in the case of TruffleRuby, it caches Truffle’s call nodes, which implement the method activation, but also optimizations such as splitting.
Based on our data, eliminating duplicate targets is also an effective optimization for Rails:
Number of Calls | After Eliminating Duplicate Targets | |||
---|---|---|---|---|
Benchmark | Polymorphic | Megamorphic | Polymorphic | Megamorphic |
BlogRails | 956,515 | 63,319 | -48.8% | -99.1% |
ERubiRails | 626,535 | 40,699 | -37.4% | -98.6% |
LiquidCartRender | 12,598 | 280 | -73.3% | -100.0% |
LiquidRenderBibs | 89,866 | 280 | -73.7% | -100.0% |
The table above gives the absolute number of calls these benchmarks do. As we can see in column two and three there are relatively few megamorphic calls to begin with. In TruffleRuby, a call site is megamorphic when there are more than 8 different receivers or target methods. Megamorphic calls can be a performance issue, especially when class hierarchies are deep and method lookup is costly, because for such megamorphic calls, we cannot cache the lookup results.
The good news is that eliminating of duplicate targets is highly effective in avoiding megamorphic calls. As we can see in column five, most calls stop being megamorphic. However, the optimization is much less effective in avoiding polymorphic calls, reducing their number by only 37.4-48.8%. This means, that about 50-60% of calls are still polymorphic.
For a basic interpreter, this isn’t too bad, because we still avoid the overhead of a lookup. However, for TruffleRuby with its just-in-time compilation, this situation is not ideal, because method inlining, i.e., replacing a call by taking the method body and integrating it into the caller during compilation, is limited.
On the positive side, our Liquid render benchmarks benefit nicely here. While I haven’t looked in detail, the number of megamorphic calls being the same suggests that these calls are made in the initial setup and eliminating duplicate targets prevents them from being megamorphic.
TruffleRuby uses an optimization that is not as common in just-in-time compiling systems: method splitting. Most just-in-time compilers rely solely on inlining to enable classic compiler optimizations and get efficient native code. Though, since TruffleRuby builds on the Truffle framework with its metacompilation approach, it tries harder to optimize even before the just-in-time compilation kicks in.
Truffle’s method splitting copies a method in a state that is uninitialized. For us most importantly this means, it copies the method without the lookup cache entries as illustrated in fig. 2. The split method, i.e. the copy, is then associated with a single call site. The idea is that this copy specializes itself in the context of this single caller, which hopefully means method calls are more likely to be monomorphic.
So, is splitting succeeding at monomorphizing call sites? Let’s look at the data. Note, we already eliminated duplicate targets. Thus, the numbers are a little smaller here.
Number of Calls (w/o duplicate targets) | After Splitting | |||
---|---|---|---|---|
Benchmark | Polymorphic | Megamorphic | Polymorphic | Megamorphic |
BlogRails | 490,072 | 557 | -100% | -100% |
ERubiRails | 391,997 | 553 | -100% | -100% |
LiquidCartRender | 2,000 | 0 | -100% | n/a |
LiquidRenderBibs | 23,633 | 0 | -100% | n/a |
Indeed, splitting is highly effective in turning polymorphic and megamorphic calls into monomorphic calls, which allows the just-in-time compiler to aggressively inline and optimize the Ruby code.
As we have seen in the last table, the polymorphic and megamorphic calls were all monomorphized. Though, let’s take a slightly different look at the data. Instead of looking at the run-time calls, let’s look at how many targets there are at a call site.
Maximum Number of Targets | |||
---|---|---|---|
before | target duplicates | after | |
Benchmark | optimizations | eliminated | splitting |
BlogRails | 206 | 24 | 2 |
ERubiRails | 206 | 24 | 2 |
LiquidCartRender | 20 | 5 | 1 |
LiquidRenderBibs | 20 | 7 | 1 |
From this table we can see that the Rails benchmarks have at least one call site with 206 different receiver types. After eliminate duplicate target methods, we see at most 24 different targets. Adding splitting to the system reduces it further to at most 2 entries. As we saw from the number of run-time calls, these optimizations in combination are indeed highly effective for Rails applications.
From this brief look, we can conclude that despite these optimizations having been invented some 30-40 years ago, they are still highly effective even for today’s dynamic Ruby on Rails systems.
In our paper, we go into many more details. We also look at how blocks behave (spoiler: they turn out to be slightly more polymorphic than methods). We investigate how lookup caches evolve over time and find patterns that may help us to improve performance further in the future. We also noticed that TruffleRuby’s splitting is a little too enthusiastic. For instance, blocks/Procs were always split, which has been fixed already. There is more work to be done to see whether splitting can further be reduced to avoid redundant work at run time. Though, that’s for another day.
Meanwhile, please give the paper a read, attend our presentation at DLS, and find us with questions, comments, and suggestions on Twitter @SophieKaleba and @smarr.
Abstract
Applications written in dynamic languages are becoming larger and larger and companies increasingly use multi-million line codebases in production. At the same time, dynamic languages rely heavily on dynamic optimizations, particularly those that reduce the overhead of method calls.
In this work, we study the call-site behavior of Ruby benchmarks that are being used to guide the development of upcoming Ruby implementations such as TruffleRuby and YJIT. We study the interaction of call-site lookup caches, method splitting, and elimination of duplicate call-targets.
We find that these optimizations are indeed highly effective on both smaller and large benchmarks, methods and closures alike, and help to open up opportunities for further optimizations such as inlining. However, we show that TruffleRuby’s splitting may be applied too aggressively on already-monomorphic call-sites, coming at a run-time cost. We also find three distinct patterns in the evolution of call-site behavior over time, which may help to guide novel optimizations. We believe that our results may support language implementers in optimizing runtime systems for large codebases built in dynamic languages.
@inproceedings{Kaleba:2022:CallSites, abstract = {Applications written in dynamic languages are becoming larger and larger and companies increasingly use multi-million line codebases in production. At the same time, dynamic languages rely heavily on dynamic optimizations, particularly those that reduce the overhead of method calls. In this work, we study the call-site behavior of Ruby benchmarks that are being used to guide the development of upcoming Ruby implementations such as TruffleRuby and YJIT. We study the interaction of call-site lookup caches, method splitting, and elimination of duplicate call-targets. We find that these optimizations are indeed highly effective on both smaller and large benchmarks, methods and closures alike, and help to open up opportunities for further optimizations such as inlining. However, we show that TruffleRuby's splitting may be applied too aggressively on already-monomorphic call-sites, coming at a run-time cost. We also find three distinct patterns in the evolution of call-site behavior over time, which may help to guide novel optimizations. We believe that our results may support language implementers in optimizing runtime systems for large codebases built in dynamic languages.}, acceptancerate = {0.4}, author = {Kaleba, Sophie and Larose, Octave and Jones, Richard and Marr, Stefan}, blog = {https://stefan-marr.de/2022/11/how-effective-are-classic-lookup-optimizations-for-rails-apps/}, booktitle = {Proceedings of the 18th Symposium on Dynamic Languages}, day = {7}, doi = {10.1145/3563834.3567538}, keywords = {Analysis CallSite DynamicLanguages Inlining LookupCache MeMyPublication Splitting myown}, location = {Auckland, New Zealand}, month = dec, note = {(acceptance rate 40%)}, pages = {14}, pdf = {https://stefan-marr.de/downloads/dls22-kaleba-et-al-analyzing-the-run-time-call-site-behavior-of-ruby-applications.pdf}, publisher = {ACM}, series = {DLS'22}, title = {Who You Gonna Call: Analyzing the Run-time Call-Site Behavior of Ruby Applications}, year = {2022}, month_numeric = {12} }
Thanks to Sophie, Octave, and Chris Seaton for suggestions and corrections on this blog post.
]]>In languages like JavaScript, Python, and Ruby, objects are much more flexible than in many other languages including Java, C++, or C#, because fields can be added and possibly even removed dynamically as needed.
I’ll use JavaScript for our examples. Let’s imagine we are working with a sensor that can determine location and movement information. Not all data may be available at every point when we access the sensor. Indeed, most likely we may just have the current longitude and latitude. Perhaps sometimes, we have access to precise GPS coordinates, which also give us altitude. In even more rare cases, the sensor might even be moving, which gives us a bearing and speed. If we imagine this in code, we might get something that looks vaguely like the following code.
1
2
3
4
5
6
7
8
9
10
11
12
let location = {}
location.longitude = 51.28; // getLong()
location.latitude = 1.08; // getLat()
if (hasAltitude()) {
location.altitude = getAltitude();
}
if (isMoving()) {
location.bearing = getBearing();
location.speed = getSpeed();
}
Thus, when we access the sensor, we create a new JavaScript object
and then add the longitude
and latitude
fields.
Depending on the available data, we may still add the field for altitude
as well
as bearing
and speed
.
However, if the data is not available, our JavaScript object will only have
longitude
and latitude
.
In addition to adding fields arbitrarily, in JavaScript, we can even delete them. So, how do our modern language implementations implement this efficiently?
The most direct way to implement objects, where one can add and remove fields arbitrarily, would probably be some kind of hash table or a list of field names and their values. However, neither of these two approaches is as efficient as directly accessing a known memory offset for a specific field as it’s possible in languages with less flexible objects.
To gain the same efficiency in dynamic languages, hidden classes were invented. The key idea is that a language implementation can determine at run time the structure of objects, create a kind of map or hidden class that can tell us which fields an object has, possibly even the types stored in a field, and most importantly where the field can be found in memory. This works well because most code is much less dynamic than what the language would allow for.
For our example code from above, fig. 1 shows us how this may look like
in an implementation.
We start out with our location
object being empty.
The object only contains is a pointer to an empty hidden class.
Once we execute the code on line 2, the longitude
field is added.
We store the value 51.28
into an array that we use to store all field values.
Since it’s the first, it’s stored at index 0, and we record this in the hidden
class. However, we really want to be able to reuse hidden classes easily.
So, instead of changing the existing hidden class, we create a new one,
which records longitude
being stored at index 0.
Basically the same happens when we execute line 3, and add the latitude
to the
object. We need to expand the array by one slot to hold the value 1.08
,
and create a new hidden class that includes that latitude
is stored at index 1.
The next time we would execute those lines again, we wouldn’t actually need to create new hidden classes but can lookup them up based on the fields that we are adding.
Though, so far, we only looked at the first three lines of code. Lines 6, 10, and 11 add more fields, but do so conditionally. Focusing only on the graph of hidden classes, fig. 2 shows what that would look like.
altitude
field
is not present.The first three hidden classes are the same as before.
In the case that we have neither an altitude nor a movement,
we simply stay in the third hidden class.
However, if we have any of these additional details, the so-called hidden class
graph would further evolve.
In this particular case, we would even introduce a branch depending on whether
we have the altitude details.
If we first have the altitude, in the simplest kind of hidden class graph,
we would end up with bearing
and speed
being stored at different indexes
in the object.
In his previous work on the eJSVM, Tomoharu already relied on JavaScript programs showing fairly stable patterns in their behavior. This means, the difference based on user input and between different runs of a program is relatively minor, when one observes a good sample of program executions. Thus, we came up with the idea of using a classic profile-guided optimization approach to optimize the hidden class graph of an application.
The basic idea is that during execution of a representative set of runs, we can use the garbage collector to gather statistics about the kind of objects used in a program, their hidden classes, and which of the hidden classes are used most. With this information, we can optimize the hidden class graph of a program for future executions.
Specifically, we apply the following optimizations:
Let’s go through these optimizations step by step.
For our example, let’s assume we profiled a number of representative executions
and found out that indeed altitude and movement are rarely available.
Thus, most objects will only have longitude
and latitude
fields.
Furthermore, our data also tells us that it basically never happens that we have
movement information without also having the altitude.
This means, we can apply our first optimization, and “move” the branch
for adding altitude
in the hidden class graph.
Figure 3 highlights the change in red.
By moving the branch for adding altitude
to the hidden class after adding bearing
and speed
,
we can merge the two branches since they are now identical.
This means, the dotted hidden classes in the figure can be dropped.
In the paper, we go into a bit more detail of how to make this correct without
changing JavaScript semantics.
As the second optimization, we eliminate “temporary” hidden classes,
which are rarely used over longer time spans.
The prime example for these temporary hidden classes are the ones that
are only used between adding fields one after another.
As highlighted with dotted lines in fig. 4, the
hidden class between adding longitude
and latitude
, as well as the ones
between adding bearing
, speed
, and finally altitude
can be removed.
This leaves in the end, only three hidden classes in our graph.
The third optimization is relevant for larger programs and not directly visible in our example. However, often different code paths would create the same kind of objects. Thus, the hidden classes would be basically the same, which allows us to merge these identical parts of the hidden class graph.
For the evaluation, we used a variation of the Are We Fast Yet benchmarks. Here, I’ll just quickly look at the memory savings.
For these measurements, we first collected the profiling information for each benchmark, optimized the hidden class graph, and then used the result to guide the execution.
As can be seen in fig. 5, the optimized hidden class graphs are quite a bit smaller. We reduce their memory use by about 62% on average.
By reducing the overall number of hidden classes in the system, we also noticed a few speedups in our benchmarks. Figure 6 shows that the reducing in hidden classes reduces the cache misses for eJSVM’s single-entry lookup caches, especially for larger benchmarks such as CD and Havlak. However, there are other effects. For instance with the hidden classes known up front, we can size objects correctly on allocation, avoid the extra array to store the fields, and reduce the number of times the array needs to be expanded because of frequent transitions.
In the paper, we discuss a lot more details, background, and evaluation. Of course, there are various corner cases to be considered, for example things like JavaScript’s prototype objects, how to make sure that the JavaScript semantics don’t break even though the hidden classes change, and a few other bits.
Please give the paper a read, attend our presentation at VMIL’22, and find some of us for questions, comments, and suggestions on Twitter @profrejones and @smarr.
Abstract
JavaScript is increasingly used for the Internet of Things (IoT) on embedded systems. However, JavaScript’s memory footprint is a challenge, because normal JavaScript virtual machines (VMs) do not fit into the small memory of IoT devices. In part this is because a significant amount of memory is used by hidden classes, which are used to represent JavaScript’s dynamic objects efficiently.
In this research, we optimize the hidden class graph to minimize their memory use. Our solution collects the hidden class graph and related information for an application in a profiling run, and optimizes the graph offline. We reduce the number of hidden classes by avoiding introducing intermediate ones, for instance when properties are added one after another. Our optimizations allow the VM to assign the most likely final hidden class to an object at its creation. They also minimize re-allocation of storage for property values, and reduce the polymorphism of inline caches.
We implemented these optimizations in a JavaScript VM, eJSVM, and found that offline optimization can eliminate 61.9% of the hidden classes on average. It also improves execution speed by minimizing the number of hidden class transitions for an object and reducing inline cache misses.
@inproceedings{Ugawa:2022:HCGOpt, abstract = {JavaScript is increasingly used for the Internet of Things (IoT) on embedded systems. However, JavaScript's memory footprint is a challenge, because normal JavaScript virtual machines (VMs) do not fit into the small memory of IoT devices. In part this is because a significant amount of memory is used by hidden classes, which are used to represent JavaScript's dynamic objects efficiently. In this research, we optimize the hidden class graph to minimize their memory use. Our solution collects the hidden class graph and related information for an application in a profiling run, and optimizes the graph offline. We reduce the number of hidden classes by avoiding introducing intermediate ones, for instance when properties are added one after another. Our optimizations allow the VM to assign the most likely final hidden class to an object at its creation. They also minimize re-allocation of storage for property values, and reduce the polymorphism of inline caches. We implemented these optimizations in a JavaScript VM, eJSVM, and found that offline optimization can eliminate 61.9% of the hidden classes on average. It also improves execution speed by minimizing the number of hidden class transitions for an object and reducing inline cache misses.}, author = {Ugawa, Tomoharu and Marr, Stefan and Jones, Richard}, booktitle = {Proceedings of the 14th ACM SIGPLAN International Workshop on Virtual Machines and Intermediate Languages}, day = {5}, doi = {10.1145/3563838.3567678}, keywords = {EmbeddedSystems HiddenClasses InlineCaching IoT JavaScript MeMyPublication OfflineOptimization VirtualMachine myown}, location = {Auckland, New Zealand}, month = dec, pages = {11}, pdf = {https://stefan-marr.de/downloads/vmil22-ugawa-et-al-profile-guided-offline-optimization-of-hidden-class-graphs.pdf}, publisher = {ACM}, series = {VMIL'22}, title = {Profile Guided Offline Optimization of Hidden Class Graphs for JavaScript VMs in Embedded Systems}, year = {2022}, month_numeric = {12} }
Thanks to Tomoharu and Richard for suggestions and corrections on this blog post.
]]>Normally, I’d say, please do not try this at home. Though, this time, it’s more appropriate:
Warning: Do not do this in production!
It’s fine to do it at home and for curiosity, but we are all better off when our programming languages are safe.
For context, with the increased interest in Rust, people are arguing again about all kind of programming languages and their performance characteristics with respect to safety. One of the widely used ones is Java of course, which I am using for my research.
With GraalVM’s native image at hand, it seemed like a good opportunity to assess roughly how much performance goes into some of Java’s memory safety features. I am using native image ahead-of-time compilation here, because JVMs like HotSpot can do all kind of nice tricks to remove more overhead at run time than other statically compiled languages can. To level the playing field, let’s stick to static ahead-of-time compilation for this experiment.
Let’s start with a few examples of how Java ensures memory safety.
When accessing an object in Java, it first makes sure that the object is
an actual object and not null
.
This means, when the following code tries to access the Element
object e
,
we have an implicit null
check,
which prevents us from accessing arbitrary memory.
public Object testImplicitIsNull(Element e) {
return e.getValue();
}
To illustrate this, here the relevant part of the compiler graph:
We can see an If
node, which is not in our Java code.
This If
is the implicit null
check. In the graph,
we also see a IsNull
node that access the method’s parameter 1, indicated as P(1)
in the graph. The true
branch on the left of the If
leads to a BytecodeException
node, which in later compiler phases would be mapped to a classic NullPointerException
.
When accessing arrays, Java similarly checks that we are not accessing the array outside of its boundaries. Again, trying to make sure that we do not access arbitrary memory. The following method is thus compiled to include an implicit bounds check, too:
public Object boundsCheck(Object[] array, int index) {
return array[index];
}
This time, let’s look at the complete compiler graph for the method:
Reading it from the top, we notice again the null
check.
And indeed, since arrays are objects, we of course also need to
check that we have a proper object first.
A little further down, we see the ArrayLength
node, which reads the length of the array, and then we see a |<|
node, which checks that the second parameter to our method, our index
is smaller than the array length but not smaller than zero. Only if that’s the case, we will read from the array with the LoadIndexed
node.
For the case that we have an out-of-bounds index, the compiler turns the BytecodeException
node into a ArrayIndexOutOfBoundsException
exception.
Other safety features I won’t go into include for instance,
casts causing a ClassCaseException
to avoid unsafe accesses.
For this experiment, I implemented two new compiler phases for Graal.
Both essentially look for the BytecodeException
nodes and remove
these exception branches from the compiler graphs.
In the Graal compiler, this is relatively straight forward.
We can find the If
node from the BytecodeException
node
and set it’s condition to always be either true
of false
depending in which branch the BytecodeException
node was,
ensuring that it can never be activated.
Graal can then remove this branch as it is not reachable anymore.
I used two phases so that I catch the exception branches very early on the compilation process in the first phase. And then essentially do the same thing a second time after some higher level nodes have been lowered into groups of nodes which may again contain these safety checks.
At the moment, I don’t distinguish different types of bytecode exceptions, of which there are a few, which means, the result is indeed inherently unsafe, and even changes Java semantics, possibly breaking programs. Some Java code may, and in practice indeed does, depend on these implicit exceptions. Thus, the compiler phases are only applied to application code for which I know that it is safe.
The code can be found here: graal:unsafe-java.
The first set of experiments is a set of benchmarks that were designed to assess cross-language performance, the Are We Fast Yet benchmarks. They contain 9 microbenchmarks, and 5 slightly larger ones to cover various common language features of object-oriented languages.
On some of these benchmarks, it doesn’t make any difference at all, since the benchmarks are dominated by other aspects. Though, even for larger benchmarks, it can reduce run time by 3-10%.
Run time | ||
---|---|---|
change | ||
CD | -4% | |
DeltaBlue | -10% | |
Havlak | -3% | |
Json | -5% | |
Richards | -10% | |
Bounce | -7% | |
List | -3% | |
Mandelbrot | 0% | |
NBody | 0% | |
Permute | -4% | |
Queens | -28% | |
Sieve | -22% | |
Storage | 0% | |
Towers | -5% |
To emphasize: these benchmarks are designed to compare performance across languages, and to identify possible compiler optimizations. They are very likely not going to translate directly to the application you may care about.
For the full results, see ReBenchDB:are-we-fast-yet.
For me, there are a few “applications” I care deeply about at the moment. My research focuses on speeding up interpreters, especially in the GraalVM/Truffle ecosystems.
TruffleSOM, a research language, has two different type of interpreters. A Truffle-style abstract syntax tree (AST) interpreter as well as a interpreter based on bytecodes.
For the AST interpreter, focusing just on the larger benchmarks, we see the following results:
Run time | ||
---|---|---|
change | ||
DeltaBlue | -5% | |
GraphSearch | -7% | |
Json | -5% | |
NBody | -5% | |
PageRank | -8% | |
Richards | -6% |
So, the changes are in the 5-8% range. For an interpreter, that is a significant improvement. Typically any improvement here is hard won. (Full Results: TruffleSOM: AST interp.)
Let’s also look at the bytecode interpreter:
Run time | ||
---|---|---|
change | ||
DeltaBlue | -8% | |
GraphSearch | -12% | |
Json | -10% | |
NBody | -14% | |
PageRank | -16% | |
Richards | -14% |
Here, we see a higher benefit, in the 8-16% range. This is likely because of the access to the bytecode array, which now doesn’t do the bounds check any longer. Since this is a very common operation, it gives a good extra bonus. (Full Results: TruffleSOM: BC interp.)
Since TruffleSOM is a research language, fairly small in comparison to other languages, let’s have a look at the impact on TruffleRuby, which is a mature implementation of Ruby, and Ruby is a quite a bit more complex language.
Here I cherry picked a couple of larger Ruby applications just to keep it brief:
Run time | ||
---|---|---|
change | ||
AsciidoctorConvertTiny | -5% | |
AsciidoctorLoadFileTiny | -5% | |
ActiveSupport | -10% | |
LiquidCartParse | -10% | |
LiquidCartRender | -1% | |
LiquidMiddleware | -17% | |
LiquidParseAll | -9% | |
LiquidRenderBibs | -7% | |
OptCarrot | -9% |
Here we see a range of 1-17% improvements, but the measurements are somewhat noisy. (Full Results: ReBenchDB: TruffleRuby)
The big question is now: are our interpreters correct enough to do away with the extra safety layer? It’s likely not a good idea, no…
For questions or comments, please find me on Twitter.
]]>