Stefan-Marr.de

5 Reasons Why Box Plots are the Better Default Choice for Visualizing Performance

2024-06-18T08:38:38+01:00

Box Plots, Or Better!

This post is motivated by discussions I have been having for, ehm, forever?

To encourage others to use good research practices and avoid bar charts, I’ll argue that people should use box plots as their go-to choice when presenting performance results. Of course, box plots aren’t a one-size-fits-all solution. However, I believe they should be the preferred choice for many standard situations. For some situations, more appropriate chart types should be chosen based on careful consideration.

Thus, box plots should be the default choice instead of the omnipresent bar chart. Or short: Box Plots, or Better!

When working on performance, I usually work with just-in-time compiling language runtimes, on which I would run various experiments that I want to compare. For examples, check the papers of Humphrey, Octave, and Sophie (copies are here). However, I believe the argument applies more generally beyond our own work.

Reason 1: Performance Measurements Are Samples from a Distribution

When we measure the performance of a system, we usually get a data point that has been influenced by many different factors. This is independent of whether we measure wall-clock time, the number of executed instructions, or perhaps memory. While we can control some factors and influence others, today’s systems are often too complex for us to fully understand them. For example, cache effects, thermal properties, as well as hard- and software interactions outside our control can change performance non-deterministically. In practice, we therefore often treat the system as a black box.¹¹ I’d encourage people to dig deeper, but I’m aware that time does not always allow for it. Treating it as a black box then of course requires us to repeat our experiments multiple times to be able to characterize the range of results that are to be expected. Statisticians would perhaps describe our measuring as “sampling a distribution”.

And this is the point where box plots come in. They are designed to be a convenient way to characterize distributions. Let’s assume we have an experiment A and B, and we have taken 50 measurements each. Figure 1 shows the results of our experiments as box plots.

Figure 1: Box plot comparing A and B,
including annotations for the key elements of a box plot.

I annotated the box plot for A with some key elements, including the median, 25th, and 75th percentile. We also see the notion of an interquartile range, which tells us a bit about the shape of the result distribution and outliers, i.e., typically all measurements that are further from the 25th and 75th percentile than 1.5x the interquartile range.

Wikipedia has a good overview of box plots that also goes deeper.

Reason 2: Allows Detailed Visual Comparison

With box plots, we have enough details to see that the two experiments behave differently in a number of ways.

The median lines tell us that A is usually faster than B. However, we also see that A is not always faster than B, because the results are further spread out. In the worst case, A takes 19 seconds, which is more than B’s worst case of 15 seconds. While the main half of all data points for both experiments don’t overlap, we see that a good chunk of A’s results still fall within what’s often not considered to be outliers, i.e., the range between the 75th percentile with 1.5x of the interquartile range added.

By looking at the figure and comparing these plots, I believe we can get a reasonable intuition of the performance tradeoffs of the two options.

Reason 3: Box Plots Give Enough Details

The above analysis of the results would not have been possible for instance with a classic bar chart as shown in Figure 2.

Figure 2: Bar chart comparing A and B, showing the mean and the standard deviation as error bars.

Bar charts are often used to compare the performance of two or more systems or experiments. However, they show only three values per bar, typically a chosen “measure of centrality”, and some form of “error”. Very common are here things like the arithmetic mean, geometric mean, harmonic mean, and perhaps the median. Each of these has different properties, and one has to carefully think about which one to use based on the type of data one is working with (or perhaps not). At this point one also still has to chose how to characterize measurement errors.

This means that bar charts are less standardized than box plots, and one has to be explicit about what is shown.

Figure 3: Bar chart comparing A and B, showing the median and 25th and 75th percentile.

To just give one example, Figure 3 is the same data but shows the median and the 25th and 75th percentile instead of the mean and standard deviation.

Since we show different sets of statistics, our impression of the results somewhat changes. Of course, this is the power of visualization and picking statistics. We can draw attention to specific aspects of the data. Figure 3 would lead me to conclude that A is always better than B, while Figure 2 would make me wonder what the underlying data looks like to understand how we got to the depicted standard deviation.

Compared to our box plot in Figure 1, the choice of statistics to show, and the reduced number of details we see here can result in misleading others and ourselves. Thus, I’d strongly argue that bar charts are neither a good default to represent data during data analysis, nor when presenting the final insights in a paper. They show too few details, oversimplifying an often more complex story.

Reason 4: Box Plots Don’t Overwhelm With Details

Of course, we could also go in the other direction and choose a plot type that shows much more detail.

Let’s start with Figure 4, which shows a violin plot. I selected here a version that shows just the density distribution of our results. One could go and highlight specific statistics on it for clarity of course. However, just looking at Figure 4, we get a more detailed look at how our measurements are distributed. From this, we see very clearly that B’s results are grouped much tighter together, and at each end, i.e., at 9 and 15 seconds, there are outliers. A on the other hand, is much more stretched out, though, a good chunk of the results are indeed roughly in the area indicated by the box plot previously. Though, what we see here also is that the area is wider and stretches from perhaps 8 to 15, only outside of which we likely have significantly fewer samples. We did not see these details on Figure 1.

Figure 4: A violin plot to compare A and B showing the density distribution of the results.

For data analysis, this way of looking at the data is very helpful, because it allows us to see the underlying distribution. For reporting data in a paper, this might however be too detailed, in the sense that it is not as easily interpretable visually and makes drawing conclusions harder.

Figure 5: A combination of violin, box plot, and raw data. The mean is indicated as a red dot.

While not ideal for final reports, violin plots are useful during analysis. Perhaps one wants to go even a step further and use a combination of violin and box plot together with the raw data and the mean during analysis. An example of this is shown in Figure 5. While the plot is very busy and not suitable for a paper, I’d think, it prevents us from jumping to conclusions based on data summaries.

If you’re analyzing your data in R, a package like ggstatsplot might be a good solution.

Reason 5: They Are Very Versatile

Box plots can be used for many different purposes, independent of the type of distribution of data one wants to visualize, for different types of experiments, and to represent experimental data, as well as data summaries.

Because box plots visualize selected “percentile statistics”, we can use them without having to adapt them for specific experiments or types of distributions. They are nonparametric, i.e., one does not have to select any parameters for specific input data. This is useful for performance evaluations, because we do not generally know what type of distribution we are dealing with, and samples are not generally independent, which makes the use of various other statistical tools more complicated.

Figure 6: Comparing experiments A, B, and C. Their data is drawn from different distributions, none of which are normal distributions. The density plot at the bottom characterizes the sample distributions more precisely.

Figure 6 shows box plots for three different distributions. Important here is that neither of these experiments gives normally distributed data. Nonetheless, we can use box plots to describe them more abstractly and see certain key details such as A being skewed to the left, B slightly less so but much more narrow, and C having outliers to the left, with a small skew to the right.

So far, I have used examples where there was an experiment A and B, and perhaps C. Though, often we may want to understand the relation of perhaps two variables. This might be in the sense of scaling a computation over multiple processor cores. Figure 7 shows a box plot that visualizes data for such a hypothetical experiment.

Figure 7: Comparing A and B as they change for increasing values of a second variable from 1 to 20.

While one would often use line charts for such scaling experiments, box plots can be used here as well. One can still see the rough shape of a line, but we do not lose sight of the distribution of our experimental data. Arguably, Figure 7 is very busy though, and a line chart with a confidence interval or similar would look better (Python, ggplot).

We can also see that box plots “scale” reasonably well themselves in the sense that they work for data that is spread out as well as for data that is very closely grouped. For example for B, we see the values at x-axis point 1 to be very narrowly together. Similarly for A, we see at 20 that data is tightly grouped. In either case, we still have the complete power of the box plot and can draw conclusions.

If we would now want to summarize these results, we can of course use box plots!

Figure 8: Summary of the data of Figure 7. A box plot of a box plot. In the upper half, it's a box plot over the medians. In the lower half, it's a box plot over all raw data.

Figure 8 shows a summary. The plot at the top uses the medians for each of the experiments over the variable that went from 1 to 20. So, for A and B, we have 20 values each, and plot them as box plots. Note, for this to be a valid statistic, technically the medians have to be derived from independent samples, so, you may need to consult your friendly neighborhood statistician.

In the bottom plot, I used the raw data of all experiments. In a way this still “works”, and results in a very similar box plots in this case. Though, here the meaning changes and of course whether you can do this with your data is something you need to ask your statistician about. I think, common wisdom in our field is to first normalize the data and then “bootstrap” it. This would give us bootstrapped medians etc. The median is then technically from a normal distribution of independent samples, and standard statistics are legal again.

Conclusion: Box Plots Answer Important Questions At A Glance. Use Box Plots, Or Better!

When it comes to writing academic papers, I do believe that box plots are a much better default choice for communicating performance results than bar charts are.

The key reasons for me are:

they are a concise representation of the result distribution
they allow a visual comparison of more than the most basic statistics
and thus, answer more questions than bar charts
but without making things too complicated
they are also more standardize, and thus, remain more readable when taken out of context
and they can be used sensibly for a wide range of use cases

So, for me box plots strike a good overall balance, which makes them a good standard choice for papers.

Though, as mentioned earlier, they are not a universally best choice either. For data analysis, one would want more details, and for specific use cases or types of data distributions, e.g., bi- or multi-modal distribution, other types of plots are more suitable. I can recommend this piece with many examples where other types of plots than box plots may be better choices.

For questions, comments, or suggestions, please find me on Twitter @smarr or Mastodon.

Why Are My Bytecode Interpreters Slow? Hunting Truffles with VTune

2024-02-23T12:27:55+00:00

As part of our work on the AST vs. Bytecode Interpreters paper, I briefly looked at how the native code of ahead-of-time-compiled bytecode loops looks like, but except for finding much more code than what I expected, I didn’t look too closely at what was going on.

After the paper was completed, I went on to make big plans, falling into the old trap of simply assuming that I roughly knew where the interpreters spend their time, and based on these assumptions developed grand ideas to make them faster. None of these ideas would be easy of course, requiring possibly month of work. Talking to my colleagues working on the Graal compiler, I was very kindly reminded that I should know and not guess were the execution time goes. I remember hearing that before, probably when I told my students the same thing…

So, here we are. This blog post has two goals: document how to build the various Truffle interpreters and how to use VTune for future me, and discuss a bit the findings of where Truffle bytecode interpreters spent much of their time.

To avoid focusing on implementation-specific aspects, I’ll look at four different Truffle-based bytecode interpreters. I’ll look at my own trusty TruffleSOM, Espresso (a JVM on Truffle), GraalPy, and GraalWasm.

To keep things brief, below I’ll just give the currently working magic incantations that produce ahead-of-time-compiled binaries for the bytecode interpreters, as well as how to run them as pure interpreters using the classic Mandelbrot benchmark.

Building the Interpreters

Let’s start with TruffleSOM. The following is the command line to build all dependencies and the bytecode interpreter. It also compiles it in a way that just-in-time compilation is disabled, something which the other implementations take as a command-line flag. The result is a binary in the same folder.

TruffleSOM$ mx build-native --build-native-image-tool --build-trufflesom --no-jit --type BC

Espresso is part of the Graal repository and all necessary build settings are conveniently maintained as part of the repository. To find the folder with the binary, we can use the second command:

espresso$ mx --env native-ce build
espresso$ mx --env native-ce graalvm-home

GraalPy is in its own repository, though can also be built similarly. It also prints conveniently the path where we find the result.

graalpy$ mx python-svm

Last but not least, GraalWasm takes a little more convincing to get the same result. Here the configuration isn’t included in the repository.

wasm$ export DYNAMIC_IMPORTS=/substratevm,/sdk,/truffle,/compiler,/wasm,/tools
wasm$ export COMPONENTS=cmp,cov,dap,gvm,gwa,ins,insight,insightheap,lg,lsp,pro,sdk,sdkl,tfl,tfla,tflc,tflm
wasm$ export NATIVE_IMAGES=lib:jvmcicompiler,lib:wasmvm
wasm$ export NON_REBUILDABLE_IMAGES=lib:jvmcicompiler
wasm$ export DISABLE_INSTALLABLES=False
wasm$ mx build
wasm$ mx graalvm-home

At this point, we have our four interpreters ready for use.

Building the Benchmarks

As benchmark, I’ll use the Are We Fast Yet version of Mandelbrot. This is mostly for my own convenience. For this experiment we just need a benchmark that mostly runs inside the bytecode loop, and Mandelbrot will do a good-enough job with that.

TruffleSOM and Python take care of compilation implicitly, but for Java and Wasm, we need to produce the jar and wasm files ourselves. For Wasm, I used the C++ version of the benchmark and Emscripten to compile it.

Java$  ant jar   # creates a benchmarks.jar
C++$   CXX=em++ OPT='-O3 -sSTANDALONE_WASM' build.sh

The -sSTANDALONE_WASM flag makes sure Emscripten gives us a wasm module that works without further issues on GraalWasm.

Running the Benchmarks

Executing the Mandelbrot benchmark is now relatively straightforward. Though, I’ll skip over the full path details below. For Espresso, GraalPy, and GraalWasm, we use the command-line flags to disable just-in-time compilation as follows.

Note the executables are the ones we built above. Running for instance java --version should show something like the following:

openjdk 21.0.2 2024-01-16
OpenJDK Runtime Environment GraalVM CE 21.0.2-dev+13.1 (build 21.0.2+13-jvmci-23.1-b33)
Espresso 64-Bit VM GraalVM CE 21.0.2-dev+13.1 (build 21-espresso-24.1.0-dev, mixed mode)

With this, running the benchmarks uses roughly the following commands:

som-native-interp-bc -cp Smalltalk Harness.som Mandelbrot 10 500
java    --experimental-options --engine.Compilation=false -cp benchmarks.jar Harness Mandelbrot 10 500
graalpy --experimental-options --engine.Compilation=false ./harness.py Mandelbrot 10 500
wasm    --experimental-options --engine.Compilation=false ./harness-em++-O3-sSTANDALONE_WASM Mandelbrot 10 500

Using VTune

There are various useful profilers out there, though, my colleagues specifically asked me to have a look at VTune, and I figured, it might be a convenient way to grab various hardware details from an Intel CPU.

However, I do not have direct access to an Intel workstation. So, instead of using the VTune desktop user interface or command line, I’ll actually use the VTune server on one of our benchmarking machines. This was surprisingly convenient and seems useful for rerunning previous experiments with different settings or binaries.

The machine is suitably protected, but I can’t recommend to use the following in an open environment:

vtune-server --web-port $PORT --enable-server-profiling --reset-passphrase

This prints the URL where the web interface can be accessed, and is configured so that we can run experiments directly from the interface, which helps with finding the various interesting options.

For all four interpreters, I’ll focus on what VTune calls the Hotspots profiling runs. I used the Hardware Event-Based Sampling setting with additional performance insights.

After it finished running the benchmark, VTune opens a Summary of the results with various statistics. Though, for this investigation most interesting is the Bottom-up view of where the program spent its time. For all four interpreters, the top function is the bytecode loop.

Opening the top function allows us to view the assembly, and group by Basic Block / Address. This neatly adds up the time of each instruction in a basic block, and gives us an impression of how much time we spent in each block.

The Bytecode Dispatch

VTune gives us a convenient way to identify which parts of the compiled code are executed and how much time we spent in it. What surprised me is that about 50% of all time is spent in bytecode dispatch. Not the bytecode operation itself, no, but the code executed for every single bytecode leading up to and including the dispatch.

Below is the full code of the “bytecode dispatch” for GraalWasm. As far as I can see, all four interpreters have roughly the same native code structure. It starts with a very long sequence of instructions that likely read various bits out of Truffle’s VirtualFrame objects, and then proceeds to do the actual dispatch via what the Graal compiler calls an IntegerSwitchNode in its intermediate representation, for which the RangeTableSwitchOp strategy is used for compilation. This encodes the bytecode dispatch by looking up the jump target in a table, and then performing the jmp %rdx instruction at the very end of the code below.

Address	Assembly	CPU Time
	Block 18	11.963s
`0x1f5b229`	`xorpd %xmm0, %xmm0`	0.306s
`0x1f5b22d`	`mov $0xfffffffffffff, %rdx`	0.486s
`0x1f5b237`	`movq %rdx, 0x1f0(%rsp)`	0.080s
`0x1f5b23f`	`mov $0xeae450, %rdx`	0.070s
`0x1f5b249`	`mov %r8d, %ebx`	0.346s
`0x1f5b24c`	`movzxb 0x10(%r10,%rbx,1), %ebx`	0.391s
`0x1f5b252`	`movq 0x18(%rdi), %rbp`	0.050s
`0x1f5b256`	`movq %rbp, 0xd0(%rsp)`	0.030s
`0x1f5b25e`	`movq 0x10(%rdi), %rbp`	0.256s
`0x1f5b262`	`movq %rbp, 0xc8(%rsp)`	0.441s
`0x1f5b26a`	`lea (%r14,%rbp,1), %rsi`	0.256s
`0x1f5b26e`	`movq %rsi, 0xc0(%rsp)`	0.020s
`0x1f5b276`	`lea 0xe(%r8), %esi`	0.611s
`0x1f5b27a`	`lea 0xd(%r8), %ebp`	0.356s
`0x1f5b27e`	`lea 0xc(%r8), %edi`	0.135s
`0x1f5b282`	`lea 0xb(%r8), %r11d`	0.055s
`0x1f5b286`	`lea 0xa(%r8), %r13d`	0.306s
`0x1f5b28a`	`movl %r13d, 0x1ec(%rsp)`	0.401s
`0x1f5b292`	`lea 0x9(%r8), %r12d`	0.125s
`0x1f5b296`	`movl %r12d, 0x1e8(%rsp)`	0.065s
`0x1f5b29e`	`lea 0x8(%r8), %edx`	0.326s
`0x1f5b2a2`	`lea 0x7(%r8), %ecx`	0.416s
`0x1f5b2a6`	`movl %esi, 0x1e4(%rsp)`	0.115s
`0x1f5b2ad`	`lea 0x6(%r8), %esi`	0.030s
`0x1f5b2b1`	`movl %ebp, 0x1e0(%rsp)`	0.311s
`0x1f5b2b8`	`lea 0x5(%r8), %ebp`	0.356s
`0x1f5b2bc`	`movl %edi, 0x1dc(%rsp)`	0.150s
`0x1f5b2c3`	`lea 0x4(%r8), %edi`	0.040s
`0x1f5b2c7`	`movl %r11d, 0x1d8(%rsp)`	0.321s
`0x1f5b2cf`	`lea 0x3(%r8), %r11d`	0.416s
`0x1f5b2d3`	`movl %r11d, 0x1d4(%rsp)`	0.135s
`0x1f5b2db`	`lea 0x2(%r8), %r13d`	0.065s
`0x1f5b2df`	`movl %r13d, 0x1d0(%rsp)`	0.276s
`0x1f5b2e7`	`mov %r8d, %r12d`	0.426s
`0x1f5b2ea`	`inc %r12d`	0.125s
`0x1f5b2ed`	`movl %r12d, 0x1cc(%rsp)`	0.045s
`0x1f5b2f5`	`movl %edx, 0x1c8(%rsp)`	0.321s
`0x1f5b2fc`	`mov %r9d, %edx`	0.441s
`0x1f5b2ff`	`inc %edx`	0.120s
`0x1f5b301`	`movl %edx, 0x1c4(%rsp)`	0.040s
`0x1f5b308`	`lea -0x3(%r9), %edx`	0.366s
`0x1f5b30c`	`movl %edx, 0x1c0(%rsp)`	0.311s
`0x1f5b313`	`lea -0x2(%r9), %edx`	0.221s
`0x1f5b317`	`movl %edx, 0x1bc(%rsp)`	0.045s
`0x1f5b31e`	`lea 0xf(%r8), %edx`	0.376s
`0x1f5b322`	`mov %r9d, %r8d`	0.366s
`0x1f5b325`	`dec %r8d`	0.085s
`0x1f5b328`	`movl %r8d, 0x1b8(%rsp)`	0.045s
`0x1f5b330`	`movl %edx, 0x1b4(%rsp)`	0.371s
`0x1f5b337`	`mov %ebx, %r9d`	0.481s
`0x1f5b33a`	`cmp $0xfe, %r9d`	0.035s
`0x1f5b341`	`jnbe 0x1f7c658`

	Block 19	0.982s
`0x1f5b347`	`lea 0xa(%rip), %rdx`	0.035s
`0x1f5b34e`	`movsxdl (%rdx,%r9,4), %r9`	0.321s
`0x1f5b352`	`add %r9, %rdx`	0.506s
`0x1f5b355`	`jmp %rdx`	0.120s

Someone who writes bytecode interpreters directly in assembly might be mortified by this code. Though, to me this is more of an artifact of some missed optimization opportunities in the otherwise excellent Graal compiler, which hopefully can be fixed.

I won’t include the results for the other interpreters here, but to summarize, let’s count the instructions of the bytecode dispatch for each of them:

TruffleSOM: 31 instructions in 1 basic block (after some extra optimization)
Espresso: 79 instructions in 2 basic blocks
GraalPy: 81 instructions in 3 basic blocks
GraalWasm: 56 instructions in 2 basic blocks

For GraalPy it is even a little more complex. There are various other basic blocks involved and none of the bytecode handler jump back directly to the same block. Instead there seems to be some more code after each handler before they jump back to the top of the loop.

The First Micro-Optimization Opportunity

As mentioned for TruffleSOM, I did already look into one optimization opportunity. The very careful reader might have noticed the end of block 18 above.

Address	Assembly	CPU Time
`0x1f5b33a`	`cmp $0xfe, %r9d`	0.035s
`0x1f5b341`	`jnbe 0x1f7c658`

This is a correctness check for the switch/case statement in the Java code. It makes sure that the value we switch over is covered by the cases in the switch. Otherwise, we’re jumping to some default block, or just back to the top of the loop.

The implementation in Graal is a little bit too hard-coded for my taste. While it has all the logic to eliminate unreachable cases, for instance when it sees that the value we switch over is guaranteed to exclude some cases, it does handle the default case directly in the machine-specific lowering code.

Seems like one could generalize this a little more and possibly handle the default case like the other cases. The relevant Graal issue for this micro-optimization is #8425. However, when I applied this to TruffleSOM’s version of Graal, and eliminated those two instructions, it didn’t make a difference. The remaining 31 instructions still dominate the bytecode dispatch.

Conclusion

The most important take away message here is of course know, don’t assume, or more specifically measure, don’t guess.

For the problem at hand, it looks like Graal struggles with hoisting some common reads out of the bytecode loops. If there’s a way to fix this, this could give a massive speedup to all Truffle-based bytecode interpreters, perhaps enough to invalidate our AST vs. Bytecode Interpreters paper. Wouldn’t that be fun?! 🤓 🧑🏻‍🔬

The mentioned micro optimization would also avoid a few instructions for every switch/case in normal Java code, when it doesn’t need the default case. So, it might be relevant for more than just bytecode interpreters.

For questions, comments, or suggestions, please find me on Twitter @smarr or Mastodon.

Addendum: Dispatch Code for PySOM’s RPython-based Bytecode Interpreter

After turning my notes into this blog post yesterday, I figured today I should also look at what RPython is doing for my PySOM bytecode interpreter.

The below 13 instructions are the bytecode dispatch for the interpreter. While it is much shorter, it also contains the safety check cmp $0x45, %al to make sure the bytecode is within the set of targets. Ideally, a bytecode verifier would have ensure that already, or perhaps we setup the native code so that there are simply no unsafe jump targets to avoid having to check every time, which at least based on VTune seems to consume a considerable amount of the overall run time. ~~Also somewhat concerning is that the check is done twice. Block 21 already has a cmp $0x45, %rax, which should make the second test unnecessary.~~ (Correction: the second check was unrelated, and I managed to remove it by applying an optimization I had already on TruffleSOM, but not yet in PySOM.)

So, yeah, I guess PySOM could be a bit faster on every single bytecode, which might mean PyPy could possibly also improve its interpreted performance.

Address	Assembly	CPU Time
	Block 20	0.085s
`0xb45d8`	`cmp %r14, %rcx`	0.085s
`0xb45db`	`jle 0xb6ba0`
	Block 21	0.631s
`0xb45e1`	`movzxb 0x10(%rdx,%r14,1), %eax`	0.311s
`0xb45e7`	`cmp $0x45, %rax`	0.321s
`0xb45eb`	`jnle 0xb6c00`
	Block 22	4.601s
`0xb45f1`	`lea 0x69868(%rip), %rdi`	0.571s
`0xb45f8`	`movq 0x10(%rdi,%rax,8), %r15`	0.020s
`0xb45fd`	`add %r14, %r15`	2.942s
`0xb4600`	`cmp $0x45, %al`	1.068s
`0xb4602`	`jnbe 0xb6c59`
	Block 23	0.476s
`0xb4608`	`movsxdl (%rbx,%rax,4), %rax`	0s
`0xb460c`	`add %rbx, %rax`	0.015s
`0xb460f`	`jmp %rax`	0.461s

Rank 10 Language Implementations

2024-02-13T22:51:23+00:00

Please rank 10 language implementations by their median performance, based on your best guess or estimate.

The above plot is based on the performance of 9 microbenchmarks and 5 slightly larger benchmarks, which were design to study the effectiveness of compilers.

Use your best guess:

My goal with this poll is to see what the general performance expectations for various language implementations are.

For questions, comments, or suggestions, please find me on Twitter @smarr or Mastodon.

Thanks for playing!

The Changing “Guarantees” Given by Python’s Global Interpreter Lock

2023-11-17T10:47:22+00:00

In this blog post, I will look into the implementation details of CPython’s Global Interpreter Lock (GIL) and how they changed between Python 3.9 and the current development branch that will become Python 3.13.

My goal is to understand which concrete “guarantees” the GIL gives in both versions, which “guarantees” it does not give, and which ones one might assume based on testing and observation. I am putting “guarantees” in quotes, because with a future no-GIL Python, none of the discussed properties should be considered language guarantees.

While Python has various implementations, including CPython, PyPy, Jython, IronPython, and GraalPy, I’ll focus on CPython as the most widely used implementation. Though, PyPy and GraalPy also use a GIL, but their implementations subtly differ from CPython’s, as we will see a little later.

1. What Is the GIL?

Let’s recap a bit of background. When CPython started to support multiple operating system threads, it became necessary to protect various CPython-internal data structures from concurrent access. Instead of adding locks or using atomic operations to protect the correctness of for instance reference counting, the content of lists, dictionaries, or internal data structures, the CPython developers decided to take a simpler approach and use a single global lock, the GIL, to protect all of these data structures from incorrect concurrent accesses. As a result, one can start multiple threads in CPython, though only a single of them runs Python bytecode at any given time.

The main benefit of this approach is its simplicity and single-threaded performance. Because there’s only a single lock to worry about, it’s easy to get the implementation correct without risking deadlocks or other subtle concurrency bugs at the level of the CPython interpreter. Thus, the GIL represented a suitable point in the engineering trade-off space between correctness and performance.

2. Why Does the Python Community Think About Removing the GIL?

Of course, the obvious downside of this design is that only a single thread can execute Python bytecode at any given time. I am talking about Python bytecode here again, because operations that may take a long time, for instance reading a file into memory, can release the GIL and allow other threads to run in parallel.

For programs that spend most of their time executing Python code, the GIL is of course a huge performance bottleneck, and thus, PEP 703 proposes to make the GIL optional. The PEP mentions various use cases, including machine learning, data science, and other numerical applications.

3. Which “Guarantees” Does the GIL Provide?

So far, I only mentioned that the GIL is there to protect CPython’s internal data structures from concurrent accesses to ensure correctness. However, when writing Python code, I am more interested in the “correctness guarantees” the GIL gives me for the concurrent code that I write. To know these “correctness guarantees”, we need to delve into the implementation details of when the GIL is acquired and released.

The general approach is that a Python thread obtains the GIL when it starts executing Python bytecode. It will hold the GIL as long as it needs to and eventually release it, for instance when it is done executing, or when it is executing some operation that often would be long-running and itself does not require the GIL for correctness. This includes for instance the aforementioned file reading operation or more generally any I/O operation. However, a thread may also release the GIL when executing specific bytecodes.

This is where Python 3.9 and 3.13 differ substantially. Let’s start with Python 3.13, which I think roughly corresponds to what Python has been doing since version 3.10 (roughly since this PR). Here, the most relevant bytecodes are for function or method calls as well as bytecodes that jump back to the top of a loop or function. Thus, only a few bytecodes check whether there was a request to release the GIL.

In contrast, in Python 3.9 and earlier versions, the GIL is released at least in some situations by almost all bytecodes. Only a small set of bytecodes including stack operations, LOAD_FAST, LOAD_CONST, STORE_FAST, UNARY_POSITIVE, IS_OP, CONTAINS_OP, and JUMP_FORWARD do not check whether the GIL should be released.

These bytecodes all use the CHECK_EVAL_BREAKER() on 3.13 (src) or DISPATCH() on 3.9 (src), which eventually checks (3.13, 3.9) whether another thread requested the GIL to be released by setting the GIL_DROP_REQUEST bit in the interpreter’s state.

What makes “atomicity guarantees” more complicated to reason about is that this bit is set by threads waiting for the GIL based on a timeout (src). The timeout is specified by sys.setswitchinterval().

In practice, what does this mean?

For Python 3.13, this should mean that a function that contains only bytecodes that do not lead to a CHECK_EVAL_BREAKER() check should be atomic.

For Python 3.9, this means a very small set of bytecode sequences can be atomic, though, except for a tiny set of specific cases, one can assume that a bytecode sequence is not atomic.

However, since the Python community is taking steps that may lead to the removal of the GIL, the changes in recent Python versions to give much stronger atomicity “guarantees” are likely a step in the wrong direction for the correctness of concurrent Python code. I mean this in the sense of people to accidentally rely on these implementation details, leading to hard to find concurrency bugs when running on a no-GIL Python.

4. Which Guarantees Might One Incorrectly Assume the GIL Provides?

Thanks to @cfbolz, I have at least one very concrete example of code that someone assumed to be atomic:

request_id = self._next_id
self._next_id += 1

The code tries to solve a classic problem: we want to hand out unique request ids, but it breaks when multiple threads execute this code at the same time, or rather interleaved with each other. Because then we end up getting the same id multiple times. This concrete bug was fixed by making the reading and incrementing atomic using a lock.

On Python 3.9, we can relatively easily demonstrate the issue:

def get_id(self):
    # expected to be atomic: start
    request_id = self.id
    self.id += 1
    # expected to be atomic: end

    self.usage[request_id % 1_000] += 1

Running this on multiple threads will allow us to observe an inconsistent number of usage counts. They should be all the same, but they are not. Arguably, it’s not clear whether the observed atomicity issue is from the request_id or the usage counts, but the underlying issue is the same in both cases. For the full example see 999_example_bug.py.

This repository contains a number of other examples that demonstrate the difference between different Python implementations and versions.

Generally, on Python 3.13 most bytecode sequences without function calls will be atomic. On Python 3.9, much few are, and I believe that would be better to avoid people from creating code that relies on the very strong guarantees that Python 3.13 gives.

As mentioned earlier, because the GIL is released based on a timeout, one may also perceive bytecode sequences as atomic when experimenting.

Let’s assume we run the following two functions on threads in parallel:

def insert_fn(list):
    for i in range(100_000_000):
        list.append(1)
        list.append(1)
        list.pop()
        list.pop()
    return True


def size_fn(list):
    for i in range(100_000_000):
        l = len(list)
        assert l % 2 == 0, f"List length was {l} at attempt {i}"
    return True

Depending on how fast the machine is, it may take 10,000 or more iterations of the loop in size_fn before we see the length of the list to be odd. This means it takes 10,000 iterations before the function calls to append or pop allowed the GIL to be released before the second append(1) or after the first pop().

Without looking at the CPython source code, one might have concluded easily that these bytecode sequences are atomic.

Though, there’s a way to make it visible earlier. By setting the thread switch interval to a very small value, for instance with sys.setswitchinterval(0.000000000001), one can observe an odd list length after only a few or few hundred iterations of the loop.

5. Comparing Observable GIL Behavior for Different CPython Versions, PyPy, and GraalPy

In my gil-sem-demos repository, I have a number of examples that try to demonstrate observable differences in GIL behavior.

Of course, the very first example tries to show the performance benefit of running multiple Python threads in parallel. Using the no-GIL implementation, one indeed sees the expected parallel speedup.

On the other tests, we see the major differences between Python 3.8 - 3.9 and the later 3.10 - 3.13 versions. The latter versions usually execute the examples without seeing results that show a bytecode-level atomicity granularity. Instead, they suggest that loop bodies without function calls are pretty much atomic.

For PyPy and GraalPy, it is also harder to observe the bytecode-level atomicity granularity, because they are simply faster. Lowering the switch interval makes it a little more observable, except for GraalPy, which likely aggressively removes the checks for whether to release the GIL.

Another detail for the no-GIL implementation: it crashes for our earlier bug example. It complains about *** stack smashing detected ***.

A full log is available as a gist.

6. Conclusion

In this blog post, I looked into the implementation details of CPython’s Global Interpreter Lock (GIL). The semantics between Python 3.9 and 3.13 differ substantially. Python 3.13 gives much stronger atomicity “guarantees”, releasing the GIL basically only on function calls and jumps back to the top of a loop or function.

If the Python community intends to remove the GIL, this seems problematic. I would expect more people to implicitly rely on these much stronger guarantees, whether consciously or not.

My guess would be that this change was done mostly in an effort to improve the single-threaded performance of CPython.

To enable people to test their code on these versions closer to semantics that match a no-GIL implementation, I would suggest to add a compile-time option to CPython that forces a GIL release and thread switch after bytecodes that may trigger behavior visible to other threads. This way, people would have a chance to test on a stable system that is closer to the future no-GIL semantics and probably only minimally slower at executing unit tests.

Any comments, suggestions, or questions are also greatly appreciated perhaps on Twitter @smarr or Mastodon.

Which Interpreters are Faster, AST or Bytecode?

2023-10-16T12:44:35+01:00

This post is a brief overview of our new study of abstract-syntax-tree and bytecode interpreters on top of RPython and the GraalVM metacompilation systems, which we are presenting next week at OOPSLA.

Prelude: Why Did I Build a Language on Top of the Truffle and RPython Metacompilation Systems?

Writing this post, I realized that I have been working on interpreters on top of metacompilation systems for 10 years now. And in much of that time, it felt to me that widely held beliefs about interpreters did not quite match my own experience.

In summer 2013, I started to explore new directions for my research after finishing my PhD just in January that year. The one question that was still bugging me back then was how I could show that my PhD ideas of using metaobject protocols to realize different concurrency models was not just possible, but perhaps even practical.

During my PhD, I worked with a bytecode interpreter written in C++ and now had the choice to spend the next few years writing a just-in-time compiler for this thing, which is never going to be able to keep up with what state-of-the-art VMs offered, or finding another way to get good performance.

To my surprise, there was another way. Though even today, some say that metacompilation is ludicrous, and will never be practical. And others, simply use PyPy as their secret sauce and enjoy better Python performance…

Though, I didn’t start with PyPy, or rather its RPython metacompilation framework. Instead, I found the Truffle framework with its Graal just-in-time compiler and implemented the Simple Object Machine (SOM) on top of it, a dynamic language I had been using for a while already. Back then, Truffle and Graal were very new, and it took a while to get my TruffleSOM to reach the desired “state-of-the-art performance”, but I got there eventually. On the way to reaching that point, I also implemented PySOM on top of PyPy’s RPython framework. This was my way to get a better understanding of metacompilation more generally But enough of history…

The Promise of Metacompilation: State-of-the-Art Performance for Little Effort

If you’re Google, you can afford to finance your own JavaScript virtual machine with a state-of-the-art interpreter, state-of-the-art garbage collector, and no less than three different just-in-time compilers, which of course are also “state of the art”. Your average academic, and even large language communities such as those around Ruby, Python, and PHP do not have the resources to build such VMs.¹ ¹ Of course, reality is more complicated, but I’ll skip over it.

This is where metacompilation comes in and promises us that we can reuse existing compilers, garbage collectors, and all the other parts of an existing high-level language VM. All we need to do is implement our language as an interpreter on top of something like the GraalVM or RPython systems.

That’s the promise. And, for a certain set of benchmarks, and a certain set of use cases, these systems deliver exactly that: reuse of these components. We still have to implement an interpreter suitable for these systems though. While this is no small feat, it’s something “an average” academic can do with enough time and stubbornness, and there are plenty of examples using Truffle and RPython.

And my SOM implementation, indeed manages to hold its own compared to Google’s V8:

Figure 1. Just-in-time-compiled peak performance of the Are We Fast Yet benchmarks, shown as an aggregate over all benchmarks on a logarithmic scale, with Java 17 (HotSpot) as the baseline. TSOM_AST reaches the performance of Node.js, while the peak performance of the other implementations is a little further behind.

As we can see here, both V8 inside of Node.js and TSOM_AST, which is short for the abstract-syntax-tree-based TruffleSOM interpreter, are roughly in the same range of being 1.7× to 2.7× slower than the HotSpot JVM. SOM as a dynamic language similar to JavaScript, and easily within a range of ±50% of performance to V8, I’d argue that metacompilation indeed lives up to its promise.

Of course, the used Are We Fast Yet benchmarks test only a relatively small common part of these languages, but they show that the core language elements common to Java, JavaScript, and other object-oriented languages reach roughly the same level of performance.

Interpreters for Metacompilation Systems

However, we wanted to talk about abstract-syntax-tree (AST) and bytecode interpreters. The goal of our work was to investigate the difference between these two approaches to build interpreters on top of GraalVM and RPython.

To this end, we had to implement no less than four interpreters, two AST interpreters and two bytecode interpreters, one each on the two different systems. We also had to optimize them roughly to the same level, so that we can draw conclusions from the comparison. While AST and bytecode interpreters naturally lend themselves to different optimizations, we didn’t stop at what’s commonly done. Instead, we implemented the classic optimizations to gain the key performance benefits, and once we hit diminishing returns, we added the optimizations from the AST interpreters to the bytecode ones or the other way around, so that they roughly implement the same set of optimizations. Currently, this includes (see Section 4 of the paper for more details):

polymorphic lookup/inline caching
inlining of control structures and anonymous functions in specific situations
superinstructions for bytecode interpreters and supernodes for AST interpreters
bytecode quickening and self-optimizing AST nodes
lowering/intrinsifying of basic standard library methods
caching of globals

Some other classic interpreter optimizations were not directly possible on top of the metacompilation systems. This includes indirect threading and top-of-stack caching for bytecode interpreters. Well, more precisely, we experimented a little with both, but they didn’t give the desired benefits and rather slowed the interpreters down, and we only made them work on a few benchmarks. Pushing this further will require extensive changes to the metacompilation systems as far as we can tell from talking to people working on GraalVM and RPython. So, future work…

With all these optimizations, we reached a point where adding further optimizations showed only minimal gains, typically specific to a benchmark, which gives us some confidence that our results are meaningful.

The final results are as follows:

Figure 2. Interpreter-only run-time performance of the Are We Fast Yet benchmarks, on a logarithmic scale, with Java 17 as baseline. While TSOM (TruffleSOM) and PySOM are overall slower than HotSpot's Java interpreter and Node.js/V8's Ignition interpreter, we can also observe that PySOM_AST and TSOM_AST are faster than the bytecode versions. TSOM_BC is the slowest interpreter overall.

Since we are confident that both types of interpreters are optimized roughly to the same level, we conclude that bytecode interpreters do not have their traditional advantage on top of metacompilation systems when it comes to pure interpreter speed.

Based on our current understanding, we attribute that to some general challenges metacompilation systems face when producing native code for the bytecode interpreter loops. The GraalVM for instance, does use the Graal IR, which only supports structured control flow. This means, we cannot directly encode arbitrary jumps between bytecodes, and the compiler also struggles with the size of bytecode loops, leaving some optimization opportunities on the table. The bytecode loops of a standard interpreter can be multiple tens or hundreds of kilobytes of native code, where bytecode interpreters written in C/C++ are typically much more concise.

We also looked at memory use, and indeed, on that metric bytecode interpreters win compared to AST interpreters. However, the difference is not as stark as one might expect, and bytecode interpreters may also require more allocations based on how they are structured for boxing or other run-time data structures.

Though, this blog post is just a high-level overview. For the details, please see the paper below.

Recommendations: AST or Bytecode?

Based on our results, what would I do, if I had to implement another language on top of Truffle or RPython? Well, as always, it really depends… Let’s look at the two ends of the spectrum I can think of:

An Established, widely used Language: For this, I’d assume that a bytecode has been defined, and it is going to evolve slowly in the future. I’d also assume it has possibly large existing programs with a lot of code, i.e., hundreds of thousands of lines of code. For these types of languages, I’d suggest to stick with the bytecode. Bytecode is fast to load and lots of code is likely executed only once or not at all. This means the memory benefits of the compact representation likely outweigh other aspects, and we get decent performance.

A Completely new Language: Here I will assume the language will first need to find is way and design may frequently change. On top of metacompilation systems, we can get really good performance, and there’s not a lot of code for our language, and our users are unlikely to care too much about loading huge amounts of code. Here, AST interpreters in the Truffle style are likely the more flexible choice. You don’t need to design a bytecode, and can instead focus on the language and getting to acceptable performance first. Later, once you have larger code bases with their own performance challenges one may still think about designing a bytecode, but I would think the fewest languages will ever get there.

For languages in between these two ends of the spectrum, one would probably want to weigh up engineering effort, which I’d think to be lower for AST interpreters, and memory use for large code bases, where bytecode interpreters are better.

AST vs. Bytecode: Interpreters in the Age of Meta-Compilation

Our paper includes more on the background, our interpreter implementations, and our experimental design. It also has a more in-depth analysis on the performance properties we observed in terms of run time and memory use. To guide the work of language implementers, we also looked at the various optimizations, to see which ones are most important to gain interpreter as well as just-in-time compiled performance.

As artifact, our paper comes with a Docker image that includes all experiments and raw data, which hopefully enables others to reproduce our results and expand on or compare to our work. The Dockerfile itself is also on GitHub, where one can also find the latest versions of TruffleSOM and PySOM. Though these two don’t yet contain the same versions as the paper, and may evolve further.

The paper is the result of a collaboration with Octave, Humphrey, and Sophie.

Any comments or suggestions are also greatly appreciated perhaps on Twitter @smarr or Mastodon.

Abstract

Thanks to partial evaluation and meta-tracing, it became practical to build language implementations that reach state-of-the-art peak performance by implementing only an interpreter. Systems such as RPython and GraalVM provide components such as a garbage collector and just-in-time compiler in a language-agnostic manner, greatly reducing implementation effort. However, meta-compilation-based language implementations still need to improve further to reach the low memory use and fast warmup behavior that custom-built systems provide. A key element in this endeavor is interpreter performance. Folklore tells us that bytecode interpreters are superior to abstract-syntax-tree (AST) interpreters both in terms of memory use and run-time performance.

This work assesses the trade-offs between AST and bytecode interpreters to verify common assumptions and whether they hold in the context of meta-compilation systems. We implemented four interpreters, each an AST and a bytecode one using RPython and GraalVM. We keep the difference between the interpreters as small as feasible to be able to evaluate interpreter performance, peak performance, warmup, memory use, and the impact of individual optimizations.

Our results show that both systems indeed reach performance close to Node.js/V8. Looking at interpreter-only performance, our AST interpreters are on par with, or even slightly faster than their bytecode counterparts. After just-in-time compilation, the results are roughly on par. This means bytecode interpreters do not have their widely assumed performance advantage. However, we can confirm that bytecodes are more compact in memory than ASTs, which becomes relevant for larger applications. However, for smaller applications, we noticed that bytecode interpreters allocate more memory because boxing avoidance is not as applicable, and because the bytecode interpreter structure requires memory, e.g., for a reified stack.

Our results show AST interpreters to be competitive on top of meta-compilation systems. Together with possible engineering benefits, they should thus not be discounted so easily in favor of bytecode interpreters.

AST vs. Bytecode: Interpreters in the Age of Meta-Compilation
O. Larose, S. Kaleba, H. Burchell, S. Marr; Proceedings of the ACM on Programming Languages, OOPSLA'23, p. 318–346, ACM, 2023.
Paper: HTML, PDF
DOI: 10.1145/3622808
Appendix: online appendix

BibTex: bibtex

@article{Larose:2023:AstVsBc,
  abstract = {Thanks to partial evaluation and meta-tracing, it became practical to build language implementations that reach state-of-the-art peak performance by implementing only an interpreter. Systems such as RPython and GraalVM provide components such as a garbage collector and just-in-time compiler in a language-agnostic manner, greatly reducing implementation effort. However, meta-compilation-based language implementations still need to improve further to reach the low memory use and fast warmup behavior that custom-built systems provide. A key element in this endeavor is interpreter performance. Folklore tells us that bytecode interpreters are superior to abstract-syntax-tree (AST) interpreters both in terms of memory use and run-time performance.
  
  This work assesses the trade-offs between AST and bytecode interpreters to verify common assumptions and whether they hold in the context of meta-compilation systems. We implemented four interpreters, each an AST and a bytecode one using RPython and GraalVM. We keep the difference between the interpreters as small as feasible to be able to evaluate interpreter performance, peak performance, warmup, memory use, and the impact of individual optimizations.
  
  Our results show that both systems indeed reach performance close to Node.js/V8. Looking at interpreter-only performance, our AST interpreters are on par with, or even slightly faster than their bytecode counterparts. After just-in-time compilation, the results are roughly on par. This means bytecode interpreters do not have their widely assumed performance advantage. However, we can confirm that bytecodes are more compact in memory than ASTs, which becomes relevant for larger applications. However, for smaller applications, we noticed that bytecode interpreters allocate more memory because boxing avoidance is not as applicable, and because the bytecode interpreter structure requires memory, e.g., for a reified stack.
  
  Our results show AST interpreters to be competitive on top of meta-compilation systems. Together with possible engineering benefits, they should thus not be discounted so easily in favor of bytecode interpreters.},
  appendix = {https://doi.org/10.5281/zenodo.8147414},
  articleno = {233},
  author = {Larose, Octave and Kaleba, Sophie and Burchell, Humphrey and Marr, Stefan},
  blog = {https://stefan-marr.de/2023/10/ast-vs-bytecode-interpreters/},
  doi = {10.1145/3622808},
  html = {https://stefan-marr.de/papers/oopsla-larose-et-al-ast-vs-bytecode-interpreters-in-the-age-of-meta-compilation/},
  issn = {2475-1421},
  journal = {Proceedings of the ACM on Programming Languages},
  keywords = {AST Bytecode CaseStudy Comparison Interpreter JITCompilation MeMyPublication MetaTracing PartialEvaluation myown},
  month = oct,
  number = {OOPSLA2},
  numpages = {29},
  pages = {318--346},
  pdf = {https://stefan-marr.de/downloads/oopsla23-larose-et-al-ast-vs-bytecode-interpreters-in-the-age-of-meta-compilation.pdf},
  publisher = {{ACM}},
  series = {OOPSLA'23},
  title = {AST vs. Bytecode: Interpreters in the Age of Meta-Compilation},
  volume = {7},
  year = {2023},
  month_numeric = {10}
}

Don’t Blindly Trust Your Java Profiler!

2023-09-20T17:59:26+01:00

How do we know on what to focus our attention when trying to optimize the performance of a program? I suspect at least some of us will reach for sampling profilers. They keep the direct impact on the program execution low, and collect stack traces every so often during the program execution. This gives us an approximate view of where a program spends its time. Though, this approximation as it turns out can be surprisingly unreliable.

Humphrey started his research work wanting to make profilers produce more directly actionable suggestions where and how to optimize programs. Though, relatively quickly we noticed that sampling profilers are not only probabilistic as one would expect, but can give widely different results between runs, which do not necessarily converge with many runs either.

In 2010, Mytkowicz et al. identified safepoint bias as a key issue for sampling profilers for Java programs. Though, their results were not quite as bad as what we were seeing, so Humphrey started to design experiments to characterize the precision and accuracy of Java profilers in more detail.

How bad does it get?

Just before getting start, we’re fully aware that this isn’t a new issue and there are quite a number of great and fairly technical blogs out there discussing a large range of issues, for instance here, here, here, and here. In our work, we will only look at fully deterministic and small pure Java benchmarks to get a better understanding of what the current situation is.

What’s the issue you may ask? Well, let’s look at an example. Figure 1 shows the profiling results of Java Flight Recorder over 30 runs on the DeltaBlue benchmark. We see 8 different methods being identified as hottest method indicated by the hatched bars in red.

Figure 1: Bar chart for top 15 methods in the DeltaBlue benchmark identified by Java Flight Recorder. A bar represents the average percentage of run time over 30 runs, and the error bars indicated the minimum and maximum values.

Of course much of this could probably be explained with the non-determinism inherent to JVMs such as HotSpot: just-in-time compilation, parallel compilation, garbage collection, etc. However, we run each benchmark not only for 30 times but also long enough to be fully compiled. So, we basically give the profiler and JVM a best-case scenario.¹ ¹ At least to the degree that is practical. Though, benchmarking is hard, and there are many things going on in modern VMs. See also Tratt’s posts on the topic: 1, 2 And what do we get as a result? No clear indication where to start optimizing our application. However, if we would have looked at only a single profile, we may have started optimizing something that is rarely the bit of code the application spends most time on.

Fortunately, this is indeed the worst case we found.

Overall, we looked at async-profiler, Honest Profiler, Java Flight Recorder, JProfiler, perf, and YourKit. Figure 2 shows box plots for each of these profilers to indicate the range of differences we found between the minimum and maximum run-time percentage reported for each method over all benchmarks. Thus, the median isn’t too bad, but each profiler shows cases where there is more than 15% difference between some of the runs.

Figure 2: An aggregate of the differences between the minimum and maximum run-time percentage per method over all benchmarks.

The paper goes into much more detail analyzing the results by comparing profilers with themselves and among each other to be able to characterize accuracy and precision without knowing a ground truth. It also includes plots that show how the results are distributed for specific methods to identify possible sources of the observed variation.

So, for all the details, please see the paper linked below. Any pointers and suggestions are also greatly appreciated perhaps on Twitter @smarr or Mastodon.

Abstract

To identify optimisation opportunities, Java developers often use sampling profilers that attribute a percentage of run time to the methods of a program. Even so these profilers use sampling, are probabilistic in nature, and may suffer for instance from safepoint bias, they are normally considered to be relatively reliable. However, unreliable or inaccurate profiles may misdirect developers in their quest to resolve performance issues by not correctly identifying the program parts that would benefit most from optimisations.

With the wider adoption of profilers such as async-profiler and Honest Profiler, which are designed to avoid the safepoint bias, we wanted to investigate how precise and accurate Java sampling profilers are today. We investigate the precision, reliability, accuracy, and overhead of async-profiler, Honest Profiler, Java Flight Recorder, JProfiler, perf, and YourKit, which are all actively maintained. We assess them on the fully deterministic Are We Fast Yet benchmarks to have a stable foundation for the probabilistic profilers.

We find that profilers are relatively reliable over 30 runs and normally report the same hottest method. Unfortunately, this is not true for all benchmarks, which suggests their reliability may be application-specific. Different profilers also report different methods as hottest and cannot reliably agree on the set of top 5 hottest methods. On the positive side, the average run time overhead is in the range of 1% to 5.4% for the different profilers.

Future work should investigate how results can become more reliable, perhaps by reducing the observer effect of profilers by using optimisation decisions of unprofiled runs or by developing a principled approach of combining multiple profiles that explore different dynamic optimisations.

Don’t Trust Your Profiler: An Empirical Study on the Precision and Accuracy of Java Profilers
H. Burchell, O. Larose, S. Kaleba, S. Marr; In Proceedings of the 20th ACM SIGPLAN International Conference on Managed Programming Languages and Runtimes, MPLR'23, p. 1–14, ACM, 2023.
Paper: PDF
DOI: 10.1145/3617651.3622985
Appendix: online appendix

BibTex: bibtex

@inproceedings{Burchell:2023:Profilers,
  abstract = {To identify optimisation opportunities, Java developers often use sampling profilers that attribute a percentage of run time to the methods of a program. Even so these profilers use sampling, are probabilistic in nature, and may suffer for instance from safepoint bias, they are normally considered to be relatively reliable. However, unreliable or inaccurate profiles may misdirect developers in their quest to resolve performance issues by not correctly identifying the program parts that would benefit most from optimisations.
  
  With the wider adoption of profilers such as async-profiler and Honest Profiler, which are designed to avoid the safepoint bias, we wanted to investigate how precise and accurate Java sampling profilers are today. We investigate the precision, reliability, accuracy, and overhead of async-profiler, Honest Profiler, Java Flight Recorder, JProfiler, perf, and YourKit, which are all actively maintained. We assess them on the fully deterministic Are We Fast Yet benchmarks to have a stable foundation for the probabilistic profilers.
  
  We find that profilers are relatively reliable over 30 runs and normally report the same hottest method. Unfortunately, this is not true for all benchmarks, which suggests their reliability may be application-specific. Different profilers also report different methods as hottest and cannot reliably agree on the set of top 5 hottest methods. On the positive side, the average run time overhead is in the range of 1% to 5.4% for the different profilers.
  
  Future work should investigate how results can become more reliable, perhaps by reducing the observer effect of profilers by using optimisation decisions of unprofiled runs or by developing a principled approach of combining multiple profiles that explore different dynamic optimisations.},
  acceptancerate = {0.54},
  appendix = {https://github.com/HumphreyHCB/AWFY-Profilers},
  author = {Burchell, Humphrey and Larose, Octave and Kaleba, Sophie and Marr, Stefan},
  blog = {https://stefan-marr.de/2023/09/dont-blindly-trust-your-profiler/},
  booktitle = {Proceedings of the 20th ACM SIGPLAN International Conference on Managed Programming Languages and Runtimes},
  doi = {10.1145/3617651.3622985},
  keywords = {CPUSampling Comparison MeMyPublication Precision Profiling myown},
  month = oct,
  pages = {1--14},
  pdf = {https://stefan-marr.de/downloads/mplr23-burchell-et-al-dont-trust-your-profiler.pdf},
  publisher = {ACM},
  series = {MPLR'23},
  title = {{Don’t Trust Your Profiler: An Empirical Study on the Precision and Accuracy of Java Profilers}},
  year = {2023},
  month_numeric = {10}
}

An Introduction to Interpreters and JIT Compilation

2023-09-11T17:00:34+01:00

Last week, I gave two lectures at the Programming Language Implementation Summer School (PLISS). PLISS was very well organized and the students and other presenters made for a very enjoyable week of new ideas, learning, and discussing.

For my own lectures, I decided to take an approach that focused more on the high-level ideas and can introduce a wider audience to how we build interpreters and a range of techniques for just-in-time compilation.

Of course, I also wanted to talk a little bit about our own work. Thus, both lectures come with the strong bias of meta-compilation systems. My interpreter lecture is informed by our upcoming OOPSLA paper, which shows that in the context of meta-compilation systems, abstract-syntax-tree interpreters are doing surprisingly well compared to bytecode interpreters.

My lecture on just-in-time compilation of course also went into how meta-compilation works and how it enables us to build languages that can reach state-of-the-art performance by compiling a user program through our interpreters. While it’s still a lot of work, the big vision is that one day, we might just define the grammar, provide a few extra details of how the language is to be executed, and then some kind of toolchain gives us a language runtime that executes user programs with state-of-the-art performance.

One can still dream… 🤓

When preparing these lectures, I was also looking back at the lectures I gave in 2019 for a summer school at Dagstuhl. Perhaps, this material will at some point form its own course on Virtual Machines. Another of those dreams…

Lectures

I have to admit, the original abstracts don’t quite represent the final lectures. So, I’ll also include the outlines in addition to the slides.

Interpreters: Everywhere And All The Time

Implementers often start with an interpreter to sketch how a language may work. They are easy to implement and great to experiment with. However, they are also an essential part of dynamic language implementations. We will talk about the basics of abstract syntax trees, bytecodes, and how these ideas can be used to implement a language. We will also look into optimizations for interpreters: how AST and bytecode interpreters can use run-time feedback to improve performance, and discuss how super nodes and super instructions allows us to make effective use of modern CPUs.

Outline

How are programming languages implemented?
Types of interpreters
- abstract syntax tree
- bytecode
Interpreter optimizations
- Lookup caching
- AST/bytecode-level inlining
- Library lowering, library intrinsification
- Super nodes, super instructions
- Self-optimization, bytecode quickening

Slides

A Brief Introduction to Just-in-Time Compilation

Since the early days of object-oriented languages, run-time polymorphism has been a challenge for implementers. Smalltalk and Self pushed many ideas to an extreme, their implementers had to invent techniques such as: lookup caches, tracing and method-based compilation, deoptimization, and maps. While these ideas originated in the ’80s and ‘90s, they are key ingredients of today’s just-in-time compilers for Java, Ruby, Python, JavaScript.

Outline

Just-in-time compilation
- Basic assumptions and application behavior
- Selection of compilation units
- Executing Dynamic Languages
- Using Run-Time Feedback
- Metacompilation
Efficient Data Representation
- Maps, hidden classes, shapes
- Storage strategies
- Handling concurrency and parallelism

Slides

If you have any questions, I am more than happy to answer, possibly on Twitter @smarr or Mastodon.

Squeezing a Little More Performance Out of Bytecode Interpreters

2023-06-06T16:53:52+01:00

Earlier this year, Wanhong Huang, Tomoharu Ugawa, and myself published some new experiments on interpreter performance. We experimented with a Genetic Algorithm to squeeze a little more performance out of bytecode interpreters. Since I spent much of my research time looking for ways to improve interpreter performance, I was quite intrigued by the basic question behind Wanhong’s experiments: which is the best order of bytecode handlers in the interpreter loop?

The Basics: Bytecode Loops and Modern Processors

Let’s start with a bit of background. Many of today’s widely used interpreters use bytecodes, which represent a program as operations quite similar to processor instructions. Though, depending on the language we are trying to support in our interpreter, the bytecodes can be arbitrarily complex, in terms of how they encode arguments, but also in terms of the behavior they implement.

In the simplest case, we would end up with an interpreter loop that looks roughly like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
uint8_t bytecode = { push_local, push_local, add, /* ... */ }
while (true) {
  uint8_t bytecode = bytecodes[index];
  index += /* ... */;
  switch (bytecode) {
    case push_local:
      // ...
    case add:
      // ...
    case call_method:
      // ...
  }
}

Listing 1: switch/case Interpreter Loop.

Here, push_local and add are much simpler than any call_method bytecode. Depending on the language that we try to implement, push_local is likely just a few processor instructions, while call_method might be significantly more complex, because it may need to lookup the method, ensure that arguments are passed correctly, and ensure that we have memory for local variables for the method that is to be executed. Since bytecodes can be arbitrarily complex, S. Brunthaler distinguished between high abstraction-level interpreters and low abstraction-level ones. High abstraction-level interpreters do not spend a lot of time on the bytecode dispatch, but low abstraction-level ones do, because their bytecodes are comparably simple, and have often just a handful of processor instructions. Thus, low abstraction-level interpreters would benefit most from optimizing the bytecode dispatch.

A classic optimization of the bytecode dispatch is threaded code interpretation, in which we represent a program not only using bytecodes, but with an additional array of jump addresses. This optimization is also often called direct threaded code. It is particularly beneficial for low abstraction-level interpreters but applied more widely.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
uint8_t bytecode = { push_local, push_local, add, /* ... */ }
void* targets = { &&push_local, &&push_local, &add, /* ... */ }

push_local:
  // ...
  void* target = targets[index];
  index += /* ... */
  goto *target;

add:
  // ...
  void* target = targets[index];
  index += /* ... */
  goto *target;

call_method:
  // ...
  void* target = targets[index];
  index += /* ... */
  goto *target;

Listing 2: Directed Threaded Interpreter.

With this interpreter optimization, we do not have the explicit while loop. Instead, we have goto labels for each bytecode handler and each handler has a separate copy of the dispatch code, that is the goto jump instruction.

This helps modern processors in at least two ways:

it avoids an extra jump to the top of the loop at the end of each bytecode,
and perhaps more importantly, we have multiple dispatch points instead of a single one.

This is important for branch prediction. In our first loop, a processor would not be able to predict where a jump is going, because the switch normally translates to a single jump that goes to most, if not all bytecodes. Though, when we have the jump at the end of a bytecode handler, we may only see a subset of bytecodes, which increases the chance that the processor can predict the jump correctly.

Unfortunately, modern processors are rather complex. They have limits for instance for how many jump targets they can remember. They may also end up combining the history for different jump instructions, perhaps because they use an associative cache based on the address of the jump instruction. They have various different caches, including the instruction cache, into which our interpreter loop ideally fits for best performance. And all these things may interact in unexpected ways.

For me, these things make it pretty hard to predict the performance of a bytecode loop for a complex language such as JavaScript or Ruby.

And if it’s too hard to understand, why not try some kind of machine learning. And that’s indeed what Wanhong and Tomoharu came up with.

Towards an Optimal Bytecode Handler Ordering

With all the complexity of modern processors, Wanhong and Tomoharu observed that changing the order of bytecode handlers can make a significant difference for the overall performance of an interpreter. Of course, this will only make a difference if our interpreter indeed spends a significant amount of time in the bytecode loop and the handler dispatch itself.

When looking at various interpreters, we will find most of them use a natural order. With that I mean, the bytecode handlers are in the order of the numbers assigned to each bytecode. Other possible orders could be a random order, or perhaps even an order based on the frequency of bytecodes or the frequency of bytecode sequences. Thus, one might simply first have the most frequently used bytecodes, and then less frequently used ones, perhaps hoping that this means the most used instructions fit into caches, or help the branch predictor in some way.

The goal of our experiments was to find out whether we can use a Genetic Algorithm to find a better ordering so that we can improve interpreter performance. We use a genetic algorithm to create new orders of bytecode handlers by producing a crossover from two existing orderings that combine both with a few handlers being reordered additionally, which adds mutation into the new order. The resulting bytecode handler order is then compile into a new interpreter for which we measure the run time of a benchmark. With a Genetic Algorithm, one can thus generate variations of handler orders that over multiple generations of crossover and mutation may evolve to a faster handler order.

I’ll skip the details here, but please check out the paper below for the specifics.

Results on a JavaScript Interpreter

So, how well does this approach work? To find out, we applied it to the eJSVM, a JavaScript interpreter that is designed for resource-constraint devices.

In the context of resource-constraint embedded devices, it may make sense to tailor an interpreter to a specific applications to gain best performance. Thus we started by optimizing the interpreter for a specific benchmark on a specific machine. To keep the time needed for the experiments manageable, we used three Intel machines and one Raspberry Pi with an ARM CPU. In many ways, optimizing for a specific benchmark is the best-case scenario, which is only practical if we can deploy a specific application together with the interpreter. Figure 1 shows the results on benchmarks from the Are We Fast Yet benchmark suite. We can see that surprisingly large improvements. While the results depend very much on the processor architecture, every single benchmark sees an improvement on all platforms.

(a) Intel Core i9-12900 with GCC 10.3

(b) Intel Core i7-11700 with GCC 9.4

(d) ARM Cortex A53 with GCC 10.2.1

Fig. 1: Speedup over the baseline interpreter after optimizing it for the given benchmark, on the given processor. The results suggest a strong influence for the processor architecture and benchmark. On the Intel Xeon, we see a speedup of up to 23% on a specific benchmark and that every single benchmark can benefit.

Unfortunately, we can’t really know which programs user will run on our interpreters for all scenarios. Thus, we also looked at how interpreter speed improves when we train the interpreter on a single benchmark. Figure 2 shows how the performance changes when we train the interpreter for a specific benchmark. In the top left corner, we see the results when training for the Bounce benchmark. While Bounce itself sees a 7.5% speedup, the same interpreter speeds up the List benchmark by more than 12%. Training the interpreter on the Permute benchmark gives how ever much less improvements for the other benchmarks.

Fig. 2: Speedup of benchmarks on interpreters trained for a specific benchmark on the Intel Xeon W-2235 with GCC 9.4. The gains here depend strongly on the benchmark. While training with CD or Queens gives an average speedup of more than 10% across all benchmarks, training on Permute only gives about 3%.

In the paper, we look at a few more aspects including which Genetic Algorithm works best and how portable performance is between architectures.

Optimizing Bytecode Handler Order for other Interpreters

Reading this blog post, you may wonder how to best go about experimenting with your own interpreter. We also briefly tried optimizing CRuby, however, we unfortunately did not yet manage to find time to continue, but we found a few things that one needs to watch out for when doing so.

First, you may have noticed that we used a relatively old versions of GCC. For eJSVM, these gave good results and did not interfere with our reordering. However, on CRuby and with newer GCCs, the compiler will start to reorder basic blocks itself, which makes it harder to get the desired results. Here flags such as -fno-reorder-blocks or -fno-reorder-blocks-and-partition may be needed. Clang didn’t seem to reorder basic blocks in the interpreter loop. As a basic test of how big a performance impact might be,

I simply ran a handful of random bytecode handler orders, which I would normally would expect to show some performance difference, likely a slowdown. Though, for CRuby I did not see a notable performance change, which suggests that bytecode dispatch may not be worth optimizing further. But it’s a bit early to tell conclusively at this point. We should give CPython and others a go, but haven’t gotten around to it just yet.

Conclusion

If you care about interpreter performance, maybe it’s worth to take a look at the interpreter loop and see whether modern processors deliver better performance when bytecode handlers get reordered.

Our results suggest that it can give large improvements when training for a specific benchmark. There is also still a benefit for other benchmarks that we did not train for, though, it depends on how similar the training benchmark is to the others.

For more details, please read the paper linked below, or reach out on Twitter @smarr.

Abstract

Interpreter performance remains important today. Interpreters are needed in resource constrained systems, and even in systems with just-in-time compilers, they are crucial during warm up. A common form of interpreters is a bytecode interpreter, where the interpreter executes bytecode instructions one by one. Each bytecode is executed by the corresponding bytecode handler.

In this paper, we show that the order of the bytecode handlers in the interpreter source code affects the execution performance of programs on the interpreter. On the basis of this observation, we propose a genetic algorithm (GA) approach to find an approximately optimal order. In our GA approach, we find an order optimized for a specific benchmark program and a specific CPU.

We evaluated the effectiveness of our approach on various models of CPUs including x86 processors and an ARM processor. The order found using GA improved the execution speed of the program for which the order was optimized between 0.8% and 23.0% with 7.7% on average. We also assess the cross-benchmark and cross-machine performance of the GA-found order. Some orders showed good generalizability across benchmarks, speeding up all benchmark programs. However, the solutions do not generalize across different machines, indicating that they are highly specific to a microarchitecture.

Optimizing the Order of Bytecode Handlers in Interpreters using a Genetic Algorithm
W. Huang, S. Marr, T. Ugawa; In The 38th ACM/SIGAPP Symposium on Applied Computing (SAC '23), SAC'23, p. 10, ACM, 2023.
Paper: PDF
DOI: 10.1145/3555776.3577712

BibTex: bibtex

@inproceedings{Huang:2023:GA,
  abstract = {Interpreter performance remains important today. Interpreters are needed in
  resource constrained systems, and even in systems with just-in-time compilers,
  they are crucial during warm up. A common form of interpreters is a bytecode
  interpreter, where the interpreter executes bytecode instructions one by one.
  Each bytecode is executed by the corresponding bytecode handler.
  
  In this paper, we show that the order of the bytecode handlers in the
  interpreter source code affects the execution performance of programs on the
  interpreter. On the basis of this observation, we propose a genetic algorithm
  (GA) approach to find an approximately optimal order. In our GA approach, we
  find an order optimized for a specific benchmark program and a specific CPU.
  
  We evaluated the effectiveness of our approach on various models of CPUs
  including x86 processors and an ARM processor. The order found using GA
  improved the execution speed of the program for which the order was optimized
  between 0.8% and 23.0% with 7.7% on average. We also assess the cross-benchmark
  and cross-machine performance of the GA-found order. Some orders showed good
  generalizability across benchmarks, speeding up all benchmark programs.
  However, the solutions do not generalize across different machines, indicating
  that they are highly specific to a microarchitecture.},
  author = {Huang, Wanhong and Marr, Stefan and Ugawa, Tomoharu},
  blog = {https://stefan-marr.de/2023/06/squeezing-a-little-more-performance-out-of-bytecode-interpreters/},
  booktitle = {The 38th ACM/SIGAPP Symposium on Applied Computing (SAC '23)},
  doi = {10.1145/3555776.3577712},
  isbn = {978-1-4503-9517-5/23/03},
  keywords = {Bytecodes CodeLayout EmbeddedSystems GeneticAlgorithm Interpreter JavaScript MeMyPublication Optimization myown},
  month = mar,
  pages = {10},
  pdf = {https://stefan-marr.de/downloads/acmsac23-huang-et-al-optimizing-the-order-of-bytecode-handlers-in-interpreters-using-a-genetic-algorithm.pdf},
  publisher = {ACM},
  series = {SAC'23},
  title = {{Optimizing the Order of Bytecode Handlers in Interpreters using a Genetic Algorithm}},
  year = {2023},
  month_numeric = {3}
}

How Effective are Classic Lookup Optimizations for Rails Apps?

2022-11-08T16:39:52+00:00

We know that Ruby and especially Rails applications can be very dynamic and pretty large. Though, many of the optimizations interpreters and even just-in-time compilers use have been invented in the 1980s and 1990s before Ruby and Rails even existed. So, I was wondering: do these optimizations still have a chance of coping with the millions of lines of Ruby code that large Rails apps from Shopify, Stripe, or GitLab have? Unfortunately, we don’t have access to such applications. As the next best thing, we took the largest Ruby benchmarks we could get our hands on, and analyzed those.

As part of her research, Sophie wrote a paper investigating the behavior of method call sites in detail. She looked at how well optimizations such as lookup caches, target duplicate elimination, and splitting apply to modern Ruby code. I’ll use the work here as a foundation and zoom in on the Rails apps we looked at. For all details including the measurement methodology, I’ll defer to sec. 3 of the paper. It also discusses how Sophie instrumented TruffleRuby and how the data was processed.

BlogRails, ERubiRails, the Liquid Benchmarks

The benchmarks I am going to be focusing on are called BlogRails, ERubiRails, LiquidCartRender, and LiquidRenderBibs. BlogRails, usually referred to as railsbench, is a small Ruby on Rails application, simulating a basic blog, as created by Rails’ scaffold generator. The benchmark accesses existing blog posts and creates new ones. The ERubiRails is a similarly small Rails app and renders an ERB template from the Discourse project.

I also included two Liquid template language benchmarks here out of curiosity. LiquidCartRender uses Liquid to render an HTML page for a shopping cart. LiquidRenderBibs renders an HTML page with a list of papers that have a variety of different data bits to be shown (specifically this one here).

						Poly. and	Used	Poly. and
		Statement		Function	Calls	Megamorphic	Call	Megamorphic
Benchmark	Statements	Coverage	Functions	Coverage	(in 1000)	Calls	Sites	Call Sites
BlogRails	118,717	48%	37,595	38%	13,863	7.4%	52,361	2.3%
ERubiRails	117,922	45%	37,328	35%	12,309	5.4%	47,794	2.3%
LiquidCartRender	23,562	39%	6,269	30%	236	5.5%	3,581	2.4%
LiquidRenderBibs	23,277	39%	6,185	29%	385	23.4%	3,466	2.8%

As the table above shows, the Rails benchmarks have about 120,000 Ruby statements each, of which 45-48% are executed. Of the circa 37,500 functions, about 35-38% are executed. In total, the BlogRails benchmark makes about 13,863,000 function calls. 7.4% of these calls are polymorphic or megamorphic.

In Ruby, a call site is considered to be monomorphic, if there is a single receiver class seen during execution, which also means there’s usually a single method that is being called. When there is more than one different receiver type, we call the call site polymorphic. Once there were more than a certain number of receiver types, a call site is megamorphic. In TruffleRuby, this happens when more than 8 different receiver types were used at the call site. Though, this is a bit of a simplification, and we’ll get into more details in the next section.

Until then we can observer that ERubiRails seems a bit less polymorphic. Only 5.4% of its calls are polymorphic or megamorphic.

The Liquid benchmarks are much smaller, with only about 23,500 statements in about 6,200 functions. The number of calls being between 236,000 and 385,000 is also significantly smaller. Surprisingly, about 23% of all calls in the LiquidRenderBibs benchmark are polymorphic. While I haven’t looked into it in more detail, I would assume that this might be an artifact of the template having to handle a large number of differences in the input data.

Compared to other languages, these numbers do not feel too different. For instance, the Dacapo Con Scala project found somewhat similar numbers for Java and Scala. In the Scala benchmarks they looked at, 89.7% of all calls were monomorphic. The Java benchmarks had about 91.5% of all calls being monomorphic.

This means what we see for Rails is roughly in line with what one would expect. This is good news, because it means that the classic optimizations are likely going to work as expected.

But before getting too enthusiastic, let’s dig a little deeper to see whether that is indeed the case.

Receiver versus Target Polymorphism

Let’s take a very simple, Rails-like example as a starting point. The following code shows the ApplicationController defining the status method, which simply returns an HTTP status code.

We also define an ArticlesController as subclass of ApplicationController. The ArticlesController implements the index method, which for brevity is kept empty.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
class ApplicationController
  def status
    200
  end
end

class ArticlesController < ApplicationController
  def index
  end
end

controllers = [
  ArticlesController.new,
  ApplicationController.new
]
controllers.select { |c| c.status == 200 }

At the end of the example on line 16, we have an array with both controllers, and select the ones with the status code being 200. The call to the status method is receiver-polymorphic. This means the call site sees multiple different receiver types, in our case the two controllers. Though, at the same time, the call site is target-monomorphic. This means, there’s only a single method that is activated.

TruffleRuby optimizes this case by using two polymorphic inline caches or more accurately dispatch chains, one after another, as depicted in fig. 1.

Fig. 1. Optimizing Method Dispatch
with two Consecutive Dispatch Chains to Eliminate Duplicate Targets.

By using two dispatch chains, a language implementation can often turn a receiver-polymorphic call site into a target-monomorphic one. The first dispatch chain acts as classic lookup cache. It takes the receiver type¹ ¹ Since TruffleRuby uses object shapes to optimize objects, they would be used as a proxy for the receiver type. and caches the method as the result of a lookup. The second cache deduplicates the target methods and in the case of TruffleRuby, it caches Truffle’s call nodes, which implement the method activation, but also optimizations such as splitting.

Based on our data, eliminating duplicate targets is also an effective optimization for Rails:

	Number of Calls		After Eliminating Duplicate Targets
Benchmark	Polymorphic	Megamorphic	Polymorphic	Megamorphic
BlogRails	956,515	63,319	-48.8%	-99.1%
ERubiRails	626,535	40,699	-37.4%	-98.6%
LiquidCartRender	12,598	280	-73.3%	-100.0%
LiquidRenderBibs	89,866	280	-73.7%	-100.0%

The table above gives the absolute number of calls these benchmarks do. As we can see in column two and three there are relatively few megamorphic calls to begin with. In TruffleRuby, a call site is megamorphic when there are more than 8 different receivers or target methods. Megamorphic calls can be a performance issue, especially when class hierarchies are deep and method lookup is costly, because for such megamorphic calls, we cannot cache the lookup results.

The good news is that eliminating of duplicate targets is highly effective in avoiding megamorphic calls. As we can see in column five, most calls stop being megamorphic. However, the optimization is much less effective in avoiding polymorphic calls, reducing their number by only 37.4-48.8%. This means, that about 50-60% of calls are still polymorphic.

For a basic interpreter, this isn’t too bad, because we still avoid the overhead of a lookup. However, for TruffleRuby with its just-in-time compilation, this situation is not ideal, because method inlining, i.e., replacing a call by taking the method body and integrating it into the caller during compilation, is limited.

On the positive side, our Liquid render benchmarks benefit nicely here. While I haven’t looked in detail, the number of megamorphic calls being the same suggests that these calls are made in the initial setup and eliminating duplicate targets prevents them from being megamorphic.

Method Splitting

TruffleRuby uses an optimization that is not as common in just-in-time compiling systems: method splitting. Most just-in-time compilers rely solely on inlining to enable classic compiler optimizations and get efficient native code. Though, since TruffleRuby builds on the Truffle framework with its metacompilation approach, it tries harder to optimize even before the just-in-time compilation kicks in.

Truffle’s method splitting copies a method in a state that is uninitialized. For us most importantly this means, it copies the method without the lookup cache entries as illustrated in fig. 2. The split method, i.e. the copy, is then associated with a single call site. The idea is that this copy specializes itself in the context of this single caller, which hopefully means method calls are more likely to be monomorphic.

Fig. 2. Method Splitting copies a method for use at a specific call site. The copy is uninitialized, which means the dispatch chains do not contain any entries yet.

So, is splitting succeeding at monomorphizing call sites? Let’s look at the data. Note, we already eliminated duplicate targets. Thus, the numbers are a little smaller here.

	Number of Calls (w/o duplicate targets)		After Splitting
Benchmark	Polymorphic	Megamorphic	Polymorphic	Megamorphic
BlogRails	490,072	557	-100%	-100%
ERubiRails	391,997	553	-100%	-100%
LiquidCartRender	2,000	0	-100%	n/a
LiquidRenderBibs	23,633	0	-100%	n/a

Indeed, splitting is highly effective in turning polymorphic and megamorphic calls into monomorphic calls, which allows the just-in-time compiler to aggressively inline and optimize the Ruby code.

Overall Monomorphization

As we have seen in the last table, the polymorphic and megamorphic calls were all monomorphized. Though, let’s take a slightly different look at the data. Instead of looking at the run-time calls, let’s look at how many targets there are at a call site.

	Maximum Number of Targets
	before	target duplicates	after
Benchmark	optimizations	eliminated	splitting
BlogRails	206	24	2
ERubiRails	206	24	2
LiquidCartRender	20	5	1
LiquidRenderBibs	20	7	1

From this table we can see that the Rails benchmarks have at least one call site with 206 different receiver types. After eliminate duplicate target methods, we see at most 24 different targets. Adding splitting to the system reduces it further to at most 2 entries. As we saw from the number of run-time calls, these optimizations in combination are indeed highly effective for Rails applications.

From this brief look, we can conclude that despite these optimizations having been invented some 30-40 years ago, they are still highly effective even for today’s dynamic Ruby on Rails systems.

Who You Gonna Call: Analyzing the Run-time Call-Site Behavior of Ruby Applications

In our paper, we go into many more details. We also look at how blocks behave (spoiler: they turn out to be slightly more polymorphic than methods). We investigate how lookup caches evolve over time and find patterns that may help us to improve performance further in the future. We also noticed that TruffleRuby’s splitting is a little too enthusiastic. For instance, blocks/Procs were always split, which has been fixed already. There is more work to be done to see whether splitting can further be reduced to avoid redundant work at run time. Though, that’s for another day.

Meanwhile, please give the paper a read, attend our presentation at DLS, and find us with questions, comments, and suggestions on Twitter @SophieKaleba and @smarr.

Abstract

Applications written in dynamic languages are becoming larger and larger and companies increasingly use multi-million line codebases in production. At the same time, dynamic languages rely heavily on dynamic optimizations, particularly those that reduce the overhead of method calls.

In this work, we study the call-site behavior of Ruby benchmarks that are being used to guide the development of upcoming Ruby implementations such as TruffleRuby and YJIT. We study the interaction of call-site lookup caches, method splitting, and elimination of duplicate call-targets.

We find that these optimizations are indeed highly effective on both smaller and large benchmarks, methods and closures alike, and help to open up opportunities for further optimizations such as inlining. However, we show that TruffleRuby’s splitting may be applied too aggressively on already-monomorphic call-sites, coming at a run-time cost. We also find three distinct patterns in the evolution of call-site behavior over time, which may help to guide novel optimizations. We believe that our results may support language implementers in optimizing runtime systems for large codebases built in dynamic languages.

Who You Gonna Call: Analyzing the Run-time Call-Site Behavior of Ruby Applications
S. Kaleba, O. Larose, R. Jones, S. Marr; In Proceedings of the 18th Symposium on Dynamic Languages, DLS'22, p. 14, ACM, 2022.
Paper: PDF
DOI: 10.1145/3563834.3567538

BibTex: bibtex

@inproceedings{Kaleba:2022:CallSites,
  abstract = {Applications written in dynamic languages are becoming larger and larger and companies increasingly use multi-million line codebases in production. At the same time, dynamic languages rely heavily on dynamic optimizations, particularly those that reduce the overhead of method calls.
  
  In this work, we study the call-site behavior of Ruby benchmarks that are being used to guide the development of upcoming Ruby implementations such as TruffleRuby and YJIT. We study the interaction of call-site lookup caches, method splitting, and elimination of duplicate call-targets.
  
  We find that these optimizations are indeed highly effective on both smaller and large benchmarks, methods and closures alike, and help to open up opportunities for further optimizations such as inlining. However, we show that TruffleRuby's splitting may be applied too aggressively on already-monomorphic call-sites, coming at a run-time cost. We also find three distinct patterns in the evolution of call-site behavior over time, which may help to guide novel optimizations. We believe that our results may support language implementers in optimizing runtime systems for large codebases built in dynamic languages.},
  acceptancerate = {0.4},
  author = {Kaleba, Sophie and Larose, Octave and Jones, Richard and Marr, Stefan},
  blog = {https://stefan-marr.de/2022/11/how-effective-are-classic-lookup-optimizations-for-rails-apps/},
  booktitle = {Proceedings of the 18th Symposium on Dynamic Languages},
  day = {7},
  doi = {10.1145/3563834.3567538},
  keywords = {Analysis CallSite DynamicLanguages Inlining LookupCache MeMyPublication Splitting myown},
  location = {Auckland, New Zealand},
  month = dec,
  note = {(acceptance rate 40%)},
  pages = {14},
  pdf = {https://stefan-marr.de/downloads/dls22-kaleba-et-al-analyzing-the-run-time-call-site-behavior-of-ruby-applications.pdf},
  publisher = {ACM},
  series = {DLS'22},
  title = {Who You Gonna Call: Analyzing the Run-time Call-Site Behavior of Ruby Applications},
  year = {2022},
  month_numeric = {12}
}

Acknowledgments

Thanks to Sophie, Octave, and Chris Seaton for suggestions and corrections on this blog post.

Reducing Memory Footprint by Minimizing Hidden Class Graphs

2022-10-30T12:45:51+00:00

Tomoharu noticed in his work on the eJSVM, a JavaScript virtual machine for embedded systems, that quite a bit of memory is needed for the data that helps us to represent JavaScript objects efficiently. So, we started to look into how the memory use could be reduced without sacrificing performance.

Objects in Dynamic Languages

In languages like JavaScript, Python, and Ruby, objects are much more flexible than in many other languages including Java, C++, or C#, because fields can be added and possibly even removed dynamically as needed.

I’ll use JavaScript for our examples. Let’s imagine we are working with a sensor that can determine location and movement information. Not all data may be available at every point when we access the sensor. Indeed, most likely we may just have the current longitude and latitude. Perhaps sometimes, we have access to precise GPS coordinates, which also give us altitude. In even more rare cases, the sensor might even be moving, which gives us a bearing and speed. If we imagine this in code, we might get something that looks vaguely like the following code.

1
2
3
4
5
6
7
8
9
10
11
12
let location = {}
location.longitude = 51.28;  // getLong()
location.latitude  =  1.08;  // getLat()

if (hasAltitude()) {
  location.altitude = getAltitude();
}

if (isMoving()) {
  location.bearing = getBearing();
  location.speed   = getSpeed();
}

Thus, when we access the sensor, we create a new JavaScript object and then add the longitude and latitude fields. Depending on the available data, we may still add the field for altitude as well as bearing and speed. However, if the data is not available, our JavaScript object will only have longitude and latitude.

In addition to adding fields arbitrarily, in JavaScript, we can even delete them. So, how do our modern language implementations implement this efficiently?

Hidden Class: Finding Structure in Dynamic Programs

The most direct way to implement objects, where one can add and remove fields arbitrarily, would probably be some kind of hash table or a list of field names and their values. However, neither of these two approaches is as efficient as directly accessing a known memory offset for a specific field as it’s possible in languages with less flexible objects.

To gain the same efficiency in dynamic languages, hidden classes were invented. The key idea is that a language implementation can determine at run time the structure of objects, create a kind of map or hidden class that can tell us which fields an object has, possibly even the types stored in a field, and most importantly where the field can be found in memory. This works well because most code is much less dynamic than what the language would allow for.

For our example code from above, fig. 1 shows us how this may look like in an implementation. We start out with our location object being empty. The object only contains is a pointer to an empty hidden class. Once we execute the code on line 2, the longitude field is added. We store the value 51.28 into an array that we use to store all field values. Since it’s the first, it’s stored at index 0, and we record this in the hidden class. However, we really want to be able to reuse hidden classes easily. So, instead of changing the existing hidden class, we create a new one, which records longitude being stored at index 0.

Basically the same happens when we execute line 3, and add the latitude to the object. We need to expand the array by one slot to hold the value 1.08, and create a new hidden class that includes that latitude is stored at index 1.

The next time we would execute those lines again, we wouldn’t actually need to create new hidden classes but can lookup them up based on the fields that we are adding.

Fig. 1. When a field is added to an object, the object needs to change the hidden class, which says where the field is to be stored. We may also need to expand the array to have enough space to store the field's value.

Though, so far, we only looked at the first three lines of code. Lines 6, 10, and 11 add more fields, but do so conditionally. Focusing only on the graph of hidden classes, fig. 2 shows what that would look like.

Fig. 2. Hidden class graph for our example program. For the case that there's no altitude but movement information, the hidden class graph has a branch since the altitude field is not present.

The first three hidden classes are the same as before. In the case that we have neither an altitude nor a movement, we simply stay in the third hidden class. However, if we have any of these additional details, the so-called hidden class graph would further evolve. In this particular case, we would even introduce a branch depending on whether we have the altitude details. If we first have the altitude, in the simplest kind of hidden class graph, we would end up with bearing and speed being stored at different indexes in the object.

Optimizing Hidden Class Graphs

In his previous work on the eJSVM, Tomoharu already relied on JavaScript programs showing fairly stable patterns in their behavior. This means, the difference based on user input and between different runs of a program is relatively minor, when one observes a good sample of program executions. Thus, we came up with the idea of using a classic profile-guided optimization approach to optimize the hidden class graph of an application.

The basic idea is that during execution of a representative set of runs, we can use the garbage collector to gather statistics about the kind of objects used in a program, their hidden classes, and which of the hidden classes are used most. With this information, we can optimize the hidden class graph of a program for future executions.

Specifically, we apply the following optimizations:

move branches in the graph to reduce memory use for the most-used hidden classes
eliminate hidden classes that are only used temporarily before changing to another one
merge identical branches of the graph from unrelated allocation sites

Let’s go through these optimizations step by step. For our example, let’s assume we profiled a number of representative executions and found out that indeed altitude and movement are rarely available. Thus, most objects will only have longitude and latitude fields. Furthermore, our data also tells us that it basically never happens that we have movement information without also having the altitude.

This means, we can apply our first optimization, and “move” the branch for adding altitude in the hidden class graph. Figure 3 highlights the change in red. By moving the branch for adding altitude to the hidden class after adding bearing and speed, we can merge the two branches since they are now identical. This means, the dotted hidden classes in the figure can be dropped. In the paper, we go into a bit more detail of how to make this correct without changing JavaScript semantics.

Fig. 3. By moving the rarely used branch without altitude information, the two branches become identical, and we can drop the dotted elements from the graph.

As the second optimization, we eliminate “temporary” hidden classes, which are rarely used over longer time spans. The prime example for these temporary hidden classes are the ones that are only used between adding fields one after another. As highlighted with dotted lines in fig. 4, the hidden class between adding longitude and latitude, as well as the ones between adding bearing, speed, and finally altitude can be removed. This leaves in the end, only three hidden classes in our graph.

Fig. 4. By removing intermediate classes which are only in use very briefly, the hidden class graph can be shrunk to just three hidden classes.

The third optimization is relevant for larger programs and not directly visible in our example. However, often different code paths would create the same kind of objects. Thus, the hidden classes would be basically the same, which allows us to merge these identical parts of the hidden class graph.

Results

For the evaluation, we used a variation of the Are We Fast Yet benchmarks. Here, I’ll just quickly look at the memory savings.

For these measurements, we first collected the profiling information for each benchmark, optimized the hidden class graph, and then used the result to guide the execution.

As can be seen in fig. 5, the optimized hidden class graphs are quite a bit smaller. We reduce their memory use by about 62% on average.

Fig. 5. Memory used by hidden classes and related data structures. Overall, this meta data uses about 62% less memory on average with our optimizations.

By reducing the overall number of hidden classes in the system, we also noticed a few speedups in our benchmarks. Figure 6 shows that the reducing in hidden classes reduces the cache misses for eJSVM’s single-entry lookup caches, especially for larger benchmarks such as CD and Havlak. However, there are other effects. For instance with the hidden classes known up front, we can size objects correctly on allocation, avoid the extra array to store the fields, and reduce the number of times the array needs to be expanded because of frequent transitions.

Fig. 6. Reduction in lookup/inline cache misses. eJSVM has single-entry lookup caches, which means the overall reduction in hidden classes can improve performance.

Profile Guided Offline Optimization of Hidden Class Graphs for JavaScript VMs in Embedded Systems

In the paper, we discuss a lot more details, background, and evaluation. Of course, there are various corner cases to be considered, for example things like JavaScript’s prototype objects, how to make sure that the JavaScript semantics don’t break even though the hidden classes change, and a few other bits.

Please give the paper a read, attend our presentation at VMIL’22, and find some of us for questions, comments, and suggestions on Twitter @profrejones and @smarr.

Abstract

JavaScript is increasingly used for the Internet of Things (IoT) on embedded systems. However, JavaScript’s memory footprint is a challenge, because normal JavaScript virtual machines (VMs) do not fit into the small memory of IoT devices. In part this is because a significant amount of memory is used by hidden classes, which are used to represent JavaScript’s dynamic objects efficiently.

In this research, we optimize the hidden class graph to minimize their memory use. Our solution collects the hidden class graph and related information for an application in a profiling run, and optimizes the graph offline. We reduce the number of hidden classes by avoiding introducing intermediate ones, for instance when properties are added one after another. Our optimizations allow the VM to assign the most likely final hidden class to an object at its creation. They also minimize re-allocation of storage for property values, and reduce the polymorphism of inline caches.

We implemented these optimizations in a JavaScript VM, eJSVM, and found that offline optimization can eliminate 61.9% of the hidden classes on average. It also improves execution speed by minimizing the number of hidden class transitions for an object and reducing inline cache misses.

Profile Guided Offline Optimization of Hidden Class Graphs for JavaScript VMs in Embedded Systems
T. Ugawa, S. Marr, R. Jones; In Proceedings of the 14th ACM SIGPLAN International Workshop on Virtual Machines and Intermediate Languages, VMIL'22, p. 11, ACM, 2022.
Paper: PDF
DOI: 10.1145/3563838.3567678

BibTex: bibtex

@inproceedings{Ugawa:2022:HCGOpt,
  abstract = {JavaScript is increasingly used for the Internet of Things (IoT) on embedded systems. However, JavaScript's memory footprint is a challenge, because normal JavaScript virtual machines (VMs) do not fit into the small memory of IoT devices. In part this is because a significant amount of memory is used by hidden classes, which are used to represent JavaScript's dynamic objects efficiently.
    
  In this research, we optimize the hidden class graph to minimize their memory use. Our solution collects the hidden class graph and related information for an application in a profiling run, and optimizes the graph offline. We reduce the number of hidden classes by avoiding introducing intermediate ones, for instance when properties are added one after another. Our optimizations allow the VM to assign the most likely final hidden class to an object at its creation. They also minimize re-allocation of storage for property values, and reduce the polymorphism of inline caches.
    
  We implemented these optimizations in a JavaScript VM, eJSVM, and found that offline optimization can eliminate 61.9% of the hidden classes on average. It also improves execution speed by minimizing the number of hidden class transitions for an object and reducing inline cache misses.},
  author = {Ugawa, Tomoharu and Marr, Stefan and Jones, Richard},
  booktitle = {Proceedings of the 14th ACM SIGPLAN International Workshop on Virtual Machines and Intermediate Languages},
  day = {5},
  doi = {10.1145/3563838.3567678},
  keywords = {EmbeddedSystems HiddenClasses InlineCaching IoT JavaScript MeMyPublication OfflineOptimization VirtualMachine myown},
  location = {Auckland, New Zealand},
  month = dec,
  pages = {11},
  pdf = {https://stefan-marr.de/downloads/vmil22-ugawa-et-al-profile-guided-offline-optimization-of-hidden-class-graphs.pdf},
  publisher = {ACM},
  series = {VMIL'22},
  title = {Profile Guided Offline Optimization of Hidden Class Graphs for JavaScript VMs in Embedded Systems},
  year = {2022},
  month_numeric = {12}
}

Acknowledgments

Thanks to Tomoharu and Richard for suggestions and corrections on this blog post.