Instrumentation-based Profiling on JVMs is Broken!

Categories: Research September 17, 2024

Last year, we looked at how well sampling profilers work on top of the JVM. Unfortunately, they suffer from issues such as safepoint bias and may not correctly attribute observed run time to the correct methods because of the complexities introduced by inlining and other compiler optimizations.

After looking at sampling profilers, Humphrey started to investigate instrumentation-based profilers and found during his initial investigation that they were giving much more consistent numbers. Unfortunately, it became quickly clear that the state-of-the-art instrumentation-based profilers on the JVM also have major issues, which results in profiles that are not representative of production performance. Since profilers are supposed to help us identify performance issues, they fail at their one job.

When investigating them further, we found that they interact badly with inlining and other standard optimizations. Because the profilers we found instrument JVM bytecodes, they add a lot of extra code that compiler optimizations treat as any other application code. While this does not strictly prevent optimizations such as inlining, the extra code interferes enough with the optimization that the observable behavior of a program with and without inlining is basically identical. In practice, this means that instrumentation-based profilers on the JVM are easily portable, but they can’t effectively guide developers to the code that would benefit most from attention, which is their main purpose.

Profilers that do not capture production performance will misguide us!

While they can still identify the code that is activated most often, the interaction with optimizations means that developers see mostly unoptimized behavior. With today’s highly optimizing compilers this is unfortunate, because we may end up optimizing code that the compiler normally would have optimized for us already, and we spend time on things that likely won’t make a difference in production.

Let’s look at an example from our paper:

class ActionA { int id; void execute() {} }
class ActionB { int id; void execute() {} }
var actions = getMixOfManyActions();
bubbleSortById(actions);
framework.execute(actions);

In this arguably a little contrived example, we use some kind of framework, for which we have actions that the framework applies for us. This is probably a worst case for profilers that instrument bytecodes. Here, the execute() methods would be identified as the most problematic aspect. Though, they don’t do anything. A just-in-time compiler like HotSpot’s C2, would likely end up seeing a bimorphic call site to execute() and inline both methods. And if the compiler heuristics are with us, it might even optimize out the empty loop in the framework.

So, if we assume a sufficiently smart compiler, here our inefficient code, that’s forced on us by a framework, is being taken care of by the compiler. And a good profiler, would ideally guide us to the bubbleSortById(.) as being of interest. Typically, we’d expect to get a good speedup here by switching to a more suitable sort, especially since we implicitly assume there are many actions so that this code matters in production.

To me this means, instrumentation-based profilers can only be a matter of last resort when sampling with its own flaws fails. They are just not useful enough as they are.

Can we do better than profilers that instrument bytecode?

At the time, Humphrey was quite in favor of instrumentation, because it gives very consistent results. So, he wanted to make the results of instrumentation-based profilers more realistic. Inspired by the work of Basso et al., he built an instrumentation-based profiler into the Graal just-in-time compiler that works more like classic instrumentation-based profilers for ahead-of-time-compiled language implementations.

The basic idea is illustrated below:

**Figure 1:** Instrumentation-based profilers on the JVM typically insert instrumentation very early, before compilers optimize code. In our profiler, instrumentation is inserted very late, to minimize interfering with optimizations.

Instead of inserting the instrumentation right when the bytecode is loaded, for instance with an agent or some other form of bytecode rewriting, we move the addition of instrumentation code to a much later part of the just-in-time compilation. Most importantly, we insert it only after inlining and most optimizations are performed. To keep the prototype simple, we insert the probes right before it is turned into the lower level IR. At this point, there are still a few optimizations to be performed, including instruction selection and register allocation. Though, in the grand scheme of things, these are minor.

How much better is it?

With his prototype, Humphrey managed to achieve not only much better performance than classic instrumentation-based profilers, but also minimize interference with optimizations. For a rough idea of the overall performance impact of this approach, let’s have a look at Figure 2:

**Figure 12:** Sampling-based profilers such as Async, Honest, JFR, Perf, and YourKit (in sampling mode) have very low overhead, though suffer from safepoint bias and only observe samples. YourKit and JProfiler doing instrumentation introduce overhead of two orders of magnitudes and lead to unrealistic results because of their impact on optimizations. Bubo, our prototype, has much lower overhead, and does not interfere with optimizations.

With a few extra tricks briefly sketched in the paper, we get good attribution of where time is spent, even in the presence of inlining, reduce overhead, and benefit from the more precise results of instrumentation, because it does not have the same drawbacks of only occasionally obtaining data.

There’s one major open question though: what does a correct profile look like? At the moment, we can’t assess whether our approach is correct. Sampling profilers, as we saw last year, also do not agree on a single answer. So, while we believe our approach is much better than classic instrumentation, we still need to find out how correct it is.

All results so far, and a few more technical details are in the paper linked below. Questions, pointers, and suggestions are greatly appreciated perhaps on Mastodon or Twitter @smarr.

Abstract

Profilers are crucial tools for identifying and improving application performance. However, for language implementations with just-in-time (JIT) compilation, e.g., for Java and JavaScript, instrumentation-based profilers can have significant overheads and report unrealistic results caused by the instrumentation.

In this paper, we examine state-of-the-art instrumentation-based profilers for Java to determine the realism of their results. We assess their overhead, the effect on compilation time, and the generated bytecode. We found that the profiler with the lowest overhead increased run time by 82x. Additionally, we investigate the realism of results by testing a profiler’s ability to detect whether inlining is enabled, which is an important compiler optimization. Our results document that instrumentation can alter program behavior so that performance observations are unrealistic, i.e., they do not reflect the performance of the uninstrumented program.

As a solution, we sketch late-compiler-phase-based instrumentation for just-in-time compilers, which gives us the precision of instrumentation-based profiling with an overhead that is multiple magnitudes lower than that of standard instrumentation-based profilers, with a median overhead of 23.3% (min. 1.4%, max. 464%). By inserting probes late in the compilation process, we avoid interfering with compiler optimizations, which yields more realistic results.

Towards Realistic Results for Instrumentation-Based Profilers for JIT-Compiled Systems
H. Burchell, O. Larose, S. Marr; In Proceedings of the 21st ACM SIGPLAN International Conference on Managed Programming Languages and Runtimes, MPLR'24, ACM, 2024.
Paper: PDF
DOI: 10.1145/3679007.3685058

BibTex: bibtex

@inproceedings{Burchell:2024:InstBased,
  abstract = {Profilers are crucial tools for identifying and improving application performance. However, for language implementations with just-in-time (JIT) compilation, e.g., for Java and JavaScript, instrumentation-based profilers can have significant overheads and report unrealistic results caused by the instrumentation.
  
  In this paper, we examine state-of-the-art instrumentation-based profilers for Java to determine the realism of their results. We assess their overhead, the effect on compilation time, and the generated bytecode. We found that the profiler with the lowest overhead increased run time by 82x. Additionally, we investigate the realism of results by testing a profiler’s ability to detect whether inlining is enabled, which is an important compiler optimization. Our results document that instrumentation can alter program behavior so that performance observations are unrealistic, i.e., they do not reflect the performance of the uninstrumented program.
  
  As a solution, we sketch late-compiler-phase-based instrumentation for just-in-time compilers, which gives us the precision of instrumentation-based profiling with an overhead that is multiple magnitudes lower than that of standard instrumentation-based profilers, with a median overhead of 23.3% (min. 1.4%, max. 464%). By inserting probes late in the compilation process, we avoid interfering with compiler optimizations, which yields more realistic results.},
  author = {Burchell, Humphrey and Larose, Octave and Marr, Stefan},
  blog = {https://stefan-marr.de/2024/09/instrumenation-based-profiling-on-jvms-is-broken/},
  booktitle = {Proceedings of the 21st ACM SIGPLAN International Conference on Managed Programming Languages and Runtimes},
  doi = {10.1145/3679007.3685058},
  keywords = {Graal Instrumentation JVM Java MeMyPublication Optimization Profiler Profiling Sampling myown},
  month = sep,
  pdf = {https://stefan-marr.de/downloads/mplr24-burchell-et-al-towards-realistic-results-for-instrumentation-based-profilers-for-jit-compiled-systems.pdf},
  publisher = {ACM},
  series = {MPLR'24},
  title = {{Towards Realistic Results for Instrumentation-Based Profilers for JIT-Compiled Systems}},
  year = {2024},
  month_numeric = {9}
}

Tags: Java, Benchmarking, Research, Profilers, Sampling, Instrumentation, Tooling, paper

5 Reasons Why Box Plots are the Better Default Choice for Visualizing Performance

Categories: Research June 18, 2024

Box Plots, Or Better!

This post is motivated by discussions I have been having for, ehm, forever?

To encourage others to use good research practices and avoid bar charts, I’ll argue that people should use box plots as their go-to choice when presenting performance results. Of course, box plots aren’t a one-size-fits-all solution. However, I believe they should be the preferred choice for many standard situations. For some situations, more appropriate chart types should be chosen based on careful consideration.

Thus, box plots should be the default choice instead of the omnipresent bar chart. Or short: Box Plots, or Better!

When working on performance, I usually work with just-in-time compiling language runtimes, on which I would run various experiments that I want to compare. For examples, check the papers of Humphrey, Octave, and Sophie (copies are here). However, I believe the argument applies more generally beyond our own work.

Reason 1: Performance Measurements Are Samples from a Distribution

When we measure the performance of a system, we usually get a data point that has been influenced by many different factors. This is independent of whether we measure wall-clock time, the number of executed instructions, or perhaps memory. While we can control some factors and influence others, today’s systems are often too complex for us to fully understand them. For example, cache effects, thermal properties, as well as hard- and software interactions outside our control can change performance non-deterministically. In practice, we therefore often treat the system as a black box.¹¹ I’d encourage people to dig deeper, but I’m aware that time does not always allow for it. Treating it as a black box then of course requires us to repeat our experiments multiple times to be able to characterize the range of results that are to be expected. Statisticians would perhaps describe our measuring as “sampling a distribution”.

And this is the point where box plots come in. They are designed to be a convenient way to characterize distributions. Let’s assume we have an experiment A and B, and we have taken 50 measurements each. Figure 1 shows the results of our experiments as box plots.

Figure 1: Box plot comparing A and B,
including annotations for the key elements of a box plot.

I annotated the box plot for A with some key elements, including the median, 25th, and 75th percentile. We also see the notion of an interquartile range, which tells us a bit about the shape of the result distribution and outliers, i.e., typically all measurements that are further from the 25th and 75th percentile than 1.5x the interquartile range.

Wikipedia has a good overview of box plots that also goes deeper.

Reason 2: Allows Detailed Visual Comparison

With box plots, we have enough details to see that the two experiments behave differently in a number of ways.

The median lines tell us that A is usually faster than B. However, we also see that A is not always faster than B, because the results are further spread out. In the worst case, A takes 19 seconds, which is more than B’s worst case of 15 seconds. While the main half of all data points for both experiments don’t overlap, we see that a good chunk of A’s results still fall within what’s often not considered to be outliers, i.e., the range between the 75th percentile with 1.5x of the interquartile range added.

By looking at the figure and comparing these plots, I believe we can get a reasonable intuition of the performance tradeoffs of the two options.

Reason 3: Box Plots Give Enough Details

The above analysis of the results would not have been possible for instance with a classic bar chart as shown in Figure 2.

Figure 2: Bar chart comparing A and B, showing the mean and the standard deviation as error bars.

Bar charts are often used to compare the performance of two or more systems or experiments. However, they show only three values per bar, typically a chosen “measure of centrality”, and some form of “error”. Very common are here things like the arithmetic mean, geometric mean, harmonic mean, and perhaps the median. Each of these has different properties, and one has to carefully think about which one to use based on the type of data one is working with (or perhaps not). At this point one also still has to chose how to characterize measurement errors.

This means that bar charts are less standardized than box plots, and one has to be explicit about what is shown.

Figure 3: Bar chart comparing A and B, showing the median and 25th and 75th percentile.

To just give one example, Figure 3 is the same data but shows the median and the 25th and 75th percentile instead of the mean and standard deviation.

Since we show different sets of statistics, our impression of the results somewhat changes. Of course, this is the power of visualization and picking statistics. We can draw attention to specific aspects of the data. Figure 3 would lead me to conclude that A is always better than B, while Figure 2 would make me wonder what the underlying data looks like to understand how we got to the depicted standard deviation.

Compared to our box plot in Figure 1, the choice of statistics to show, and the reduced number of details we see here can result in misleading others and ourselves. Thus, I’d strongly argue that bar charts are neither a good default to represent data during data analysis, nor when presenting the final insights in a paper. They show too few details, oversimplifying an often more complex story.

Reason 4: Box Plots Don’t Overwhelm With Details

Of course, we could also go in the other direction and choose a plot type that shows much more detail.

Let’s start with Figure 4, which shows a violin plot. I selected here a version that shows just the density distribution of our results. One could go and highlight specific statistics on it for clarity of course. However, just looking at Figure 4, we get a more detailed look at how our measurements are distributed. From this, we see very clearly that B’s results are grouped much tighter together, and at each end, i.e., at 9 and 15 seconds, there are outliers. A on the other hand, is much more stretched out, though, a good chunk of the results are indeed roughly in the area indicated by the box plot previously. Though, what we see here also is that the area is wider and stretches from perhaps 8 to 15, only outside of which we likely have significantly fewer samples. We did not see these details on Figure 1.

Figure 4: A violin plot to compare A and B showing the density distribution of the results.

For data analysis, this way of looking at the data is very helpful, because it allows us to see the underlying distribution. For reporting data in a paper, this might however be too detailed, in the sense that it is not as easily interpretable visually and makes drawing conclusions harder.

Figure 5: A combination of violin, box plot, and raw data. The mean is indicated as a red dot.

While not ideal for final reports, violin plots are useful during analysis. Perhaps one wants to go even a step further and use a combination of violin and box plot together with the raw data and the mean during analysis. An example of this is shown in Figure 5. While the plot is very busy and not suitable for a paper, I’d think, it prevents us from jumping to conclusions based on data summaries.

If you’re analyzing your data in R, a package like ggstatsplot might be a good solution.

Reason 5: They Are Very Versatile

Box plots can be used for many different purposes, independent of the type of distribution of data one wants to visualize, for different types of experiments, and to represent experimental data, as well as data summaries.

Because box plots visualize selected “percentile statistics”, we can use them without having to adapt them for specific experiments or types of distributions. They are nonparametric, i.e., one does not have to select any parameters for specific input data. This is useful for performance evaluations, because we do not generally know what type of distribution we are dealing with, and samples are not generally independent, which makes the use of various other statistical tools more complicated.

Figure 6: Comparing experiments A, B, and C. Their data is drawn from different distributions, none of which are normal distributions. The density plot at the bottom characterizes the sample distributions more precisely.

Figure 6 shows box plots for three different distributions. Important here is that neither of these experiments gives normally distributed data. Nonetheless, we can use box plots to describe them more abstractly and see certain key details such as A being skewed to the left, B slightly less so but much more narrow, and C having outliers to the left, with a small skew to the right.

So far, I have used examples where there was an experiment A and B, and perhaps C. Though, often we may want to understand the relation of perhaps two variables. This might be in the sense of scaling a computation over multiple processor cores. Figure 7 shows a box plot that visualizes data for such a hypothetical experiment.

Figure 7: Comparing A and B as they change for increasing values of a second variable from 1 to 20.

While one would often use line charts for such scaling experiments, box plots can be used here as well. One can still see the rough shape of a line, but we do not lose sight of the distribution of our experimental data. Arguably, Figure 7 is very busy though, and a line chart with a confidence interval or similar would look better (Python, ggplot).

We can also see that box plots “scale” reasonably well themselves in the sense that they work for data that is spread out as well as for data that is very closely grouped. For example for B, we see the values at x-axis point 1 to be very narrowly together. Similarly for A, we see at 20 that data is tightly grouped. In either case, we still have the complete power of the box plot and can draw conclusions.

If we would now want to summarize these results, we can of course use box plots!

Figure 8: Summary of the data of Figure 7. A box plot of a box plot. In the upper half, it's a box plot over the medians. In the lower half, it's a box plot over all raw data.

Figure 8 shows a summary. The plot at the top uses the medians for each of the experiments over the variable that went from 1 to 20. So, for A and B, we have 20 values each, and plot them as box plots. Note, for this to be a valid statistic, technically the medians have to be derived from independent samples, so, you may need to consult your friendly neighborhood statistician.

In the bottom plot, I used the raw data of all experiments. In a way this still “works”, and results in a very similar box plots in this case. Though, here the meaning changes and of course whether you can do this with your data is something you need to ask your statistician about. I think, common wisdom in our field is to first normalize the data and then “bootstrap” it. This would give us bootstrapped medians etc. The median is then technically from a normal distribution of independent samples, and standard statistics are legal again.

Conclusion: Box Plots Answer Important Questions At A Glance. Use Box Plots, Or Better!

When it comes to writing academic papers, I do believe that box plots are a much better default choice for communicating performance results than bar charts are.

The key reasons for me are:

they are a concise representation of the result distribution
they allow a visual comparison of more than the most basic statistics
and thus, answer more questions than bar charts
but without making things too complicated
they are also more standardize, and thus, remain more readable when taken out of context
and they can be used sensibly for a wide range of use cases

So, for me box plots strike a good overall balance, which makes them a good standard choice for papers.

Though, as mentioned earlier, they are not a universally best choice either. For data analysis, one would want more details, and for specific use cases or types of data distributions, e.g., bi- or multi-modal distribution, other types of plots are more suitable. I can recommend this piece with many examples where other types of plots than box plots may be better choices.

For questions, comments, or suggestions, please find me on Twitter @smarr or Mastodon.

Tags: paper, Research, Graphs, Statistics, Evaluation, R

Why Are My Bytecode Interpreters Slow? Hunting Truffles with VTune

Categories: Research, Truffle February 23, 2024

As part of our work on the AST vs. Bytecode Interpreters paper, I briefly looked at how the native code of ahead-of-time-compiled bytecode loops looks like, but except for finding much more code than what I expected, I didn’t look too closely at what was going on.

After the paper was completed, I went on to make big plans, falling into the old trap of simply assuming that I roughly knew where the interpreters spend their time, and based on these assumptions developed grand ideas to make them faster. None of these ideas would be easy of course, requiring possibly month of work. Talking to my colleagues working on the Graal compiler, I was very kindly reminded that I should know and not guess were the execution time goes. I remember hearing that before, probably when I told my students the same thing…

So, here we are. This blog post has two goals: document how to build the various Truffle interpreters and how to use VTune for future me, and discuss a bit the findings of where Truffle bytecode interpreters spent much of their time.

To avoid focusing on implementation-specific aspects, I’ll look at four different Truffle-based bytecode interpreters. I’ll look at my own trusty TruffleSOM, Espresso (a JVM on Truffle), GraalPy, and GraalWasm.

To keep things brief, below I’ll just give the currently working magic incantations that produce ahead-of-time-compiled binaries for the bytecode interpreters, as well as how to run them as pure interpreters using the classic Mandelbrot benchmark.

Building the Interpreters

Let’s start with TruffleSOM. The following is the command line to build all dependencies and the bytecode interpreter. It also compiles it in a way that just-in-time compilation is disabled, something which the other implementations take as a command-line flag. The result is a binary in the same folder.

TruffleSOM$ mx build-native --build-native-image-tool --build-trufflesom --no-jit --type BC

Espresso is part of the Graal repository and all necessary build settings are conveniently maintained as part of the repository. To find the folder with the binary, we can use the second command:

espresso$ mx --env native-ce build
espresso$ mx --env native-ce graalvm-home

GraalPy is in its own repository, though can also be built similarly. It also prints conveniently the path where we find the result.

graalpy$ mx python-svm

Last but not least, GraalWasm takes a little more convincing to get the same result. Here the configuration isn’t included in the repository.

wasm$ export DYNAMIC_IMPORTS=/substratevm,/sdk,/truffle,/compiler,/wasm,/tools
wasm$ export COMPONENTS=cmp,cov,dap,gvm,gwa,ins,insight,insightheap,lg,lsp,pro,sdk,sdkl,tfl,tfla,tflc,tflm
wasm$ export NATIVE_IMAGES=lib:jvmcicompiler,lib:wasmvm
wasm$ export NON_REBUILDABLE_IMAGES=lib:jvmcicompiler
wasm$ export DISABLE_INSTALLABLES=False
wasm$ mx build
wasm$ mx graalvm-home

At this point, we have our four interpreters ready for use.

Building the Benchmarks

As benchmark, I’ll use the Are We Fast Yet version of Mandelbrot. This is mostly for my own convenience. For this experiment we just need a benchmark that mostly runs inside the bytecode loop, and Mandelbrot will do a good-enough job with that.

TruffleSOM and Python take care of compilation implicitly, but for Java and Wasm, we need to produce the jar and wasm files ourselves. For Wasm, I used the C++ version of the benchmark and Emscripten to compile it.

Java$  ant jar   # creates a benchmarks.jar
C++$   CXX=em++ OPT='-O3 -sSTANDALONE_WASM' build.sh

The -sSTANDALONE_WASM flag makes sure Emscripten gives us a wasm module that works without further issues on GraalWasm.

Running the Benchmarks

Executing the Mandelbrot benchmark is now relatively straightforward. Though, I’ll skip over the full path details below. For Espresso, GraalPy, and GraalWasm, we use the command-line flags to disable just-in-time compilation as follows.

Note the executables are the ones we built above. Running for instance java --version should show something like the following:

openjdk 21.0.2 2024-01-16
OpenJDK Runtime Environment GraalVM CE 21.0.2-dev+13.1 (build 21.0.2+13-jvmci-23.1-b33)
Espresso 64-Bit VM GraalVM CE 21.0.2-dev+13.1 (build 21-espresso-24.1.0-dev, mixed mode)

With this, running the benchmarks uses roughly the following commands:

som-native-interp-bc -cp Smalltalk Harness.som Mandelbrot 10 500
java    --experimental-options --engine.Compilation=false -cp benchmarks.jar Harness Mandelbrot 10 500
graalpy --experimental-options --engine.Compilation=false ./harness.py Mandelbrot 10 500
wasm    --experimental-options --engine.Compilation=false ./harness-em++-O3-sSTANDALONE_WASM Mandelbrot 10 500

Using VTune

There are various useful profilers out there, though, my colleagues specifically asked me to have a look at VTune, and I figured, it might be a convenient way to grab various hardware details from an Intel CPU.

However, I do not have direct access to an Intel workstation. So, instead of using the VTune desktop user interface or command line, I’ll actually use the VTune server on one of our benchmarking machines. This was surprisingly convenient and seems useful for rerunning previous experiments with different settings or binaries.

The machine is suitably protected, but I can’t recommend to use the following in an open environment:

vtune-server --web-port $PORT --enable-server-profiling --reset-passphrase

This prints the URL where the web interface can be accessed, and is configured so that we can run experiments directly from the interface, which helps with finding the various interesting options.

For all four interpreters, I’ll focus on what VTune calls the Hotspots profiling runs. I used the Hardware Event-Based Sampling setting with additional performance insights.

After it finished running the benchmark, VTune opens a Summary of the results with various statistics. Though, for this investigation most interesting is the Bottom-up view of where the program spent its time. For all four interpreters, the top function is the bytecode loop.

Opening the top function allows us to view the assembly, and group by Basic Block / Address. This neatly adds up the time of each instruction in a basic block, and gives us an impression of how much time we spent in each block.

The Bytecode Dispatch

VTune gives us a convenient way to identify which parts of the compiled code are executed and how much time we spent in it. What surprised me is that about 50% of all time is spent in bytecode dispatch. Not the bytecode operation itself, no, but the code executed for every single bytecode leading up to and including the dispatch.

Below is the full code of the “bytecode dispatch” for GraalWasm. As far as I can see, all four interpreters have roughly the same native code structure. It starts with a very long sequence of instructions that likely read various bits out of Truffle’s VirtualFrame objects, and then proceeds to do the actual dispatch via what the Graal compiler calls an IntegerSwitchNode in its intermediate representation, for which the RangeTableSwitchOp strategy is used for compilation. This encodes the bytecode dispatch by looking up the jump target in a table, and then performing the jmp %rdx instruction at the very end of the code below.

Address	Assembly	CPU Time
	Block 18	11.963s
`0x1f5b229`	`xorpd %xmm0, %xmm0`	0.306s
`0x1f5b22d`	`mov $0xfffffffffffff, %rdx`	0.486s
`0x1f5b237`	`movq %rdx, 0x1f0(%rsp)`	0.080s
`0x1f5b23f`	`mov $0xeae450, %rdx`	0.070s
`0x1f5b249`	`mov %r8d, %ebx`	0.346s
`0x1f5b24c`	`movzxb 0x10(%r10,%rbx,1), %ebx`	0.391s
`0x1f5b252`	`movq 0x18(%rdi), %rbp`	0.050s
`0x1f5b256`	`movq %rbp, 0xd0(%rsp)`	0.030s
`0x1f5b25e`	`movq 0x10(%rdi), %rbp`	0.256s
`0x1f5b262`	`movq %rbp, 0xc8(%rsp)`	0.441s
`0x1f5b26a`	`lea (%r14,%rbp,1), %rsi`	0.256s
`0x1f5b26e`	`movq %rsi, 0xc0(%rsp)`	0.020s
`0x1f5b276`	`lea 0xe(%r8), %esi`	0.611s
`0x1f5b27a`	`lea 0xd(%r8), %ebp`	0.356s
`0x1f5b27e`	`lea 0xc(%r8), %edi`	0.135s
`0x1f5b282`	`lea 0xb(%r8), %r11d`	0.055s
`0x1f5b286`	`lea 0xa(%r8), %r13d`	0.306s
`0x1f5b28a`	`movl %r13d, 0x1ec(%rsp)`	0.401s
`0x1f5b292`	`lea 0x9(%r8), %r12d`	0.125s
`0x1f5b296`	`movl %r12d, 0x1e8(%rsp)`	0.065s
`0x1f5b29e`	`lea 0x8(%r8), %edx`	0.326s
`0x1f5b2a2`	`lea 0x7(%r8), %ecx`	0.416s
`0x1f5b2a6`	`movl %esi, 0x1e4(%rsp)`	0.115s
`0x1f5b2ad`	`lea 0x6(%r8), %esi`	0.030s
`0x1f5b2b1`	`movl %ebp, 0x1e0(%rsp)`	0.311s
`0x1f5b2b8`	`lea 0x5(%r8), %ebp`	0.356s
`0x1f5b2bc`	`movl %edi, 0x1dc(%rsp)`	0.150s
`0x1f5b2c3`	`lea 0x4(%r8), %edi`	0.040s
`0x1f5b2c7`	`movl %r11d, 0x1d8(%rsp)`	0.321s
`0x1f5b2cf`	`lea 0x3(%r8), %r11d`	0.416s
`0x1f5b2d3`	`movl %r11d, 0x1d4(%rsp)`	0.135s
`0x1f5b2db`	`lea 0x2(%r8), %r13d`	0.065s
`0x1f5b2df`	`movl %r13d, 0x1d0(%rsp)`	0.276s
`0x1f5b2e7`	`mov %r8d, %r12d`	0.426s
`0x1f5b2ea`	`inc %r12d`	0.125s
`0x1f5b2ed`	`movl %r12d, 0x1cc(%rsp)`	0.045s
`0x1f5b2f5`	`movl %edx, 0x1c8(%rsp)`	0.321s
`0x1f5b2fc`	`mov %r9d, %edx`	0.441s
`0x1f5b2ff`	`inc %edx`	0.120s
`0x1f5b301`	`movl %edx, 0x1c4(%rsp)`	0.040s
`0x1f5b308`	`lea -0x3(%r9), %edx`	0.366s
`0x1f5b30c`	`movl %edx, 0x1c0(%rsp)`	0.311s
`0x1f5b313`	`lea -0x2(%r9), %edx`	0.221s
`0x1f5b317`	`movl %edx, 0x1bc(%rsp)`	0.045s
`0x1f5b31e`	`lea 0xf(%r8), %edx`	0.376s
`0x1f5b322`	`mov %r9d, %r8d`	0.366s
`0x1f5b325`	`dec %r8d`	0.085s
`0x1f5b328`	`movl %r8d, 0x1b8(%rsp)`	0.045s
`0x1f5b330`	`movl %edx, 0x1b4(%rsp)`	0.371s
`0x1f5b337`	`mov %ebx, %r9d`	0.481s
`0x1f5b33a`	`cmp $0xfe, %r9d`	0.035s
`0x1f5b341`	`jnbe 0x1f7c658`

	Block 19	0.982s
`0x1f5b347`	`lea 0xa(%rip), %rdx`	0.035s
`0x1f5b34e`	`movsxdl (%rdx,%r9,4), %r9`	0.321s
`0x1f5b352`	`add %r9, %rdx`	0.506s
`0x1f5b355`	`jmp %rdx`	0.120s

Someone who writes bytecode interpreters directly in assembly might be mortified by this code. Though, to me this is more of an artifact of some missed optimization opportunities in the otherwise excellent Graal compiler, which hopefully can be fixed.

I won’t include the results for the other interpreters here, but to summarize, let’s count the instructions of the bytecode dispatch for each of them:

TruffleSOM: 31 instructions in 1 basic block (after some extra optimization)
Espresso: 79 instructions in 2 basic blocks
GraalPy: 81 instructions in 3 basic blocks
GraalWasm: 56 instructions in 2 basic blocks

For GraalPy it is even a little more complex. There are various other basic blocks involved and none of the bytecode handler jump back directly to the same block. Instead there seems to be some more code after each handler before they jump back to the top of the loop.

The First Micro-Optimization Opportunity

As mentioned for TruffleSOM, I did already look into one optimization opportunity. The very careful reader might have noticed the end of block 18 above.

Address	Assembly	CPU Time
`0x1f5b33a`	`cmp $0xfe, %r9d`	0.035s
`0x1f5b341`	`jnbe 0x1f7c658`

This is a correctness check for the switch/case statement in the Java code. It makes sure that the value we switch over is covered by the cases in the switch. Otherwise, we’re jumping to some default block, or just back to the top of the loop.

The implementation in Graal is a little bit too hard-coded for my taste. While it has all the logic to eliminate unreachable cases, for instance when it sees that the value we switch over is guaranteed to exclude some cases, it does handle the default case directly in the machine-specific lowering code.

Seems like one could generalize this a little more and possibly handle the default case like the other cases. The relevant Graal issue for this micro-optimization is #8425. However, when I applied this to TruffleSOM’s version of Graal, and eliminated those two instructions, it didn’t make a difference. The remaining 31 instructions still dominate the bytecode dispatch.

Conclusion

The most important take away message here is of course know, don’t assume, or more specifically measure, don’t guess.

For the problem at hand, it looks like Graal struggles with hoisting some common reads out of the bytecode loops. If there’s a way to fix this, this could give a massive speedup to all Truffle-based bytecode interpreters, perhaps enough to invalidate our AST vs. Bytecode Interpreters paper. Wouldn’t that be fun?! 🤓 🧑🏻‍🔬

The mentioned micro optimization would also avoid a few instructions for every switch/case in normal Java code, when it doesn’t need the default case. So, it might be relevant for more than just bytecode interpreters.

For questions, comments, or suggestions, please find me on Twitter @smarr or Mastodon.

Addendum: Dispatch Code for PySOM’s RPython-based Bytecode Interpreter

After turning my notes into this blog post yesterday, I figured today I should also look at what RPython is doing for my PySOM bytecode interpreter.

The below 13 instructions are the bytecode dispatch for the interpreter. While it is much shorter, it also contains the safety check cmp $0x45, %al to make sure the bytecode is within the set of targets. Ideally, a bytecode verifier would have ensure that already, or perhaps we setup the native code so that there are simply no unsafe jump targets to avoid having to check every time, which at least based on VTune seems to consume a considerable amount of the overall run time. ~~Also somewhat concerning is that the check is done twice. Block 21 already has a cmp $0x45, %rax, which should make the second test unnecessary.~~ (Correction: the second check was unrelated, and I managed to remove it by applying an optimization I had already on TruffleSOM, but not yet in PySOM.)

So, yeah, I guess PySOM could be a bit faster on every single bytecode, which might mean PyPy could possibly also improve its interpreted performance.

Address	Assembly	CPU Time
	Block 20	0.085s
`0xb45d8`	`cmp %r14, %rcx`	0.085s
`0xb45db`	`jle 0xb6ba0`
	Block 21	0.631s
`0xb45e1`	`movzxb 0x10(%rdx,%r14,1), %eax`	0.311s
`0xb45e7`	`cmp $0x45, %rax`	0.321s
`0xb45eb`	`jnle 0xb6c00`
	Block 22	4.601s
`0xb45f1`	`lea 0x69868(%rip), %rdi`	0.571s
`0xb45f8`	`movq 0x10(%rdi,%rax,8), %r15`	0.020s
`0xb45fd`	`add %r14, %r15`	2.942s
`0xb4600`	`cmp $0x45, %al`	1.068s
`0xb4602`	`jnbe 0xb6c59`
	Block 23	0.476s
`0xb4608`	`movsxdl (%rbx,%rax,4), %rax`	0s
`0xb460c`	`add %rbx, %rax`	0.015s
`0xb460f`	`jmp %rax`	0.461s

Tags: GraalVM, Performance, Profiling, SOM, Truffle, GraalPy, Espresso

Stefan-Marr.de

Stefan Marr

Associate Professor (Senior Lecturer)
and Royal Society Industry Fellow

Instrumentation-based Profiling on JVMs is Broken!

Profilers that do not capture production performance will misguide us!

Can we do better than profilers that instrument bytecode?

How much better is it?

5 Reasons Why Box Plots are the Better Default Choice for Visualizing Performance

Box Plots, Or Better!

Reason 1: Performance Measurements Are Samples from a Distribution

Reason 2: Allows Detailed Visual Comparison

Reason 3: Box Plots Give Enough Details

Reason 4: Box Plots Don’t Overwhelm With Details

Reason 5: They Are Very Versatile

Conclusion: Box Plots Answer Important Questions At A Glance. Use Box Plots, Or Better!

Why Are My Bytecode Interpreters Slow? Hunting Truffles with VTune

Building the Interpreters

Building the Benchmarks

Running the Benchmarks

Using VTune

The Bytecode Dispatch

The First Micro-Optimization Opportunity

Conclusion

Addendum: Dispatch Code for PySOM’s RPython-based Bytecode Interpreter

Older Posts