Don't Blindly Trust Your Java Profiler!

How do we know on what to focus our attention when trying to optimize the performance of a program? I suspect at least some of us will reach for sampling profilers. They keep the direct impact on the program execution low, and collect stack traces every so often during the program execution. This gives us an approximate view of where a program spends its time. Though, this approximation as it turns out can be surprisingly unreliable.

Humphrey started his research work wanting to make profilers produce more directly actionable suggestions where and how to optimize programs. Though, relatively quickly we noticed that sampling profilers are not only probabilistic as one would expect, but can give widely different results between runs, which do not necessarily converge with many runs either.

In 2010, Mytkowicz et al. identified safepoint bias as a key issue for sampling profilers for Java programs. Though, their results were not quite as bad as what we were seeing, so Humphrey started to design experiments to characterize the precision and accuracy of Java profilers in more detail.

How bad does it get?

Just before getting start, we’re fully aware that this isn’t a new issue and there are quite a number of great and fairly technical blogs out there discussing a large range of issues, for instance here, here, here, and here. In our work, we will only look at fully deterministic and small pure Java benchmarks to get a better understanding of what the current situation is.

What’s the issue you may ask? Well, let’s look at an example. Figure 1 shows the profiling results of Java Flight Recorder over 30 runs on the DeltaBlue benchmark. We see 8 different methods being identified as hottest method indicated by the hatched bars in red.

Figure 1: Bar chart for top 15 methods in the DeltaBlue benchmark identified by Java Flight Recorder. A bar represents the average percentage of run time over 30 runs, and the error bars indicated the minimum and maximum values.

Of course much of this could probably be explained with the non-determinism inherent to JVMs such as HotSpot: just-in-time compilation, parallel compilation, garbage collection, etc. However, we run each benchmark not only for 30 times but also long enough to be fully compiled. So, we basically give the profiler and JVM a best-case scenario.1 1 At least to the degree that is practical. Though, benchmarking is hard, and there are many things going on in modern VMs. See also Tratt’s posts on the topic: 1, 2 And what do we get as a result? No clear indication where to start optimizing our application. However, if we would have looked at only a single profile, we may have started optimizing something that is rarely the bit of code the application spends most time on.

Fortunately, this is indeed the worst case we found.

Overall, we looked at async-profiler, Honest Profiler, Java Flight Recorder, JProfiler, perf, and YourKit. Figure 2 shows box plots for each of these profilers to indicate the range of differences we found between the minimum and maximum run-time percentage reported for each method over all benchmarks. Thus, the median isn’t too bad, but each profiler shows cases where there is more than 15% difference between some of the runs.

Figure 2: An aggregate of the differences between the minimum and maximum run-time percentage per method over all benchmarks.

The paper goes into much more detail analyzing the results by comparing profilers with themselves and among each other to be able to characterize accuracy and precision without knowing a ground truth. It also includes plots that show how the results are distributed for specific methods to identify possible sources of the observed variation.

So, for all the details, please see the paper linked below. Any pointers and suggestions are also greatly appreciated perhaps on Twitter @smarr or Mastodon.

Abstract

To identify optimisation opportunities, Java developers often use sampling profilers that attribute a percentage of run time to the methods of a program. Even so these profilers use sampling, are probabilistic in nature, and may suffer for instance from safepoint bias, they are normally considered to be relatively reliable. However, unreliable or inaccurate profiles may misdirect developers in their quest to resolve performance issues by not correctly identifying the program parts that would benefit most from optimisations.

With the wider adoption of profilers such as async-profiler and Honest Profiler, which are designed to avoid the safepoint bias, we wanted to investigate how precise and accurate Java sampling profilers are today. We investigate the precision, reliability, accuracy, and overhead of async-profiler, Honest Profiler, Java Flight Recorder, JProfiler, perf, and YourKit, which are all actively maintained. We assess them on the fully deterministic Are We Fast Yet benchmarks to have a stable foundation for the probabilistic profilers.

We find that profilers are relatively reliable over 30 runs and normally report the same hottest method. Unfortunately, this is not true for all benchmarks, which suggests their reliability may be application-specific. Different profilers also report different methods as hottest and cannot reliably agree on the set of top 5 hottest methods. On the positive side, the average run time overhead is in the range of 1% to 5.4% for the different profilers.

Future work should investigate how results can become more reliable, perhaps by reducing the observer effect of profilers by using optimisation decisions of unprofiled runs or by developing a principled approach of combining multiple profiles that explore different dynamic optimisations.

  • Don’t Trust Your Profiler: An Empirical Study on the Precision and Accuracy of Java Profilers
    H. Burchell, O. Larose, S. Kaleba, S. Marr; In Proceedings of the 20th ACM SIGPLAN International Conference on Managed Programming Languages and Runtimes, MPLR'23, p. 1–14, ACM, 2023.
  • Paper: PDF
  • DOI: 10.1145/3617651.3622985
  • Appendix: online appendix
  • BibTex: bibtex
    @inproceedings{Burchell:2023:Profilers,
      abstract = {To identify optimisation opportunities, Java developers often use sampling profilers that attribute a percentage of run time to the methods of a program. Even so these profilers use sampling, are probabilistic in nature, and may suffer for instance from safepoint bias, they are normally considered to be relatively reliable. However, unreliable or inaccurate profiles may misdirect developers in their quest to resolve performance issues by not correctly identifying the program parts that would benefit most from optimisations.
      
      With the wider adoption of profilers such as async-profiler and Honest Profiler, which are designed to avoid the safepoint bias, we wanted to investigate how precise and accurate Java sampling profilers are today. We investigate the precision, reliability, accuracy, and overhead of async-profiler, Honest Profiler, Java Flight Recorder, JProfiler, perf, and YourKit, which are all actively maintained. We assess them on the fully deterministic Are We Fast Yet benchmarks to have a stable foundation for the probabilistic profilers.
      
      We find that profilers are relatively reliable over 30 runs and normally report the same hottest method. Unfortunately, this is not true for all benchmarks, which suggests their reliability may be application-specific. Different profilers also report different methods as hottest and cannot reliably agree on the set of top 5 hottest methods. On the positive side, the average run time overhead is in the range of 1% to 5.4% for the different profilers.
      
      Future work should investigate how results can become more reliable, perhaps by reducing the observer effect of profilers by using optimisation decisions of unprofiled runs or by developing a principled approach of combining multiple profiles that explore different dynamic optimisations.},
      acceptancerate = {0.54},
      appendix = {https://github.com/HumphreyHCB/AWFY-Profilers},
      author = {Burchell, Humphrey and Larose, Octave and Kaleba, Sophie and Marr, Stefan},
      blog = {https://stefan-marr.de/2023/09/dont-blindly-trust-your-profiler/},
      booktitle = {Proceedings of the 20th ACM SIGPLAN International Conference on Managed Programming Languages and Runtimes},
      doi = {10.1145/3617651.3622985},
      keywords = {CPUSampling Comparison MeMyPublication Precision Profiling myown},
      month = oct,
      pages = {1--14},
      pdf = {https://stefan-marr.de/downloads/mplr23-burchell-et-al-dont-trust-your-profiler.pdf},
      publisher = {ACM},
      series = {MPLR'23},
      title = {{Don’t Trust Your Profiler: An Empirical Study on the Precision and Accuracy of Java Profilers}},
      year = {2023},
      month_numeric = {10}
    }
    

An Introduction to Interpreters and JIT Compilation

Last week, I gave two lectures at the Programming Language Implementation Summer School (PLISS). PLISS was very well organized and the students and other presenters made for a very enjoyable week of new ideas, learning, and discussing.

For my own lectures, I decided to take an approach that focused more on the high-level ideas and can introduce a wider audience to how we build interpreters and a range of techniques for just-in-time compilation.

Of course, I also wanted to talk a little bit about our own work. Thus, both lectures come with the strong bias of meta-compilation systems. My interpreter lecture is informed by our upcoming OOPSLA paper, which shows that in the context of meta-compilation systems, abstract-syntax-tree interpreters are doing surprisingly well compared to bytecode interpreters.

My lecture on just-in-time compilation of course also went into how meta-compilation works and how it enables us to build languages that can reach state-of-the-art performance by compiling a user program through our interpreters. While it’s still a lot of work, the big vision is that one day, we might just define the grammar, provide a few extra details of how the language is to be executed, and then some kind of toolchain gives us a language runtime that executes user programs with state-of-the-art performance.

One can still dream… 🤓

When preparing these lectures, I was also looking back at the lectures I gave in 2019 for a summer school at Dagstuhl. Perhaps, this material will at some point form its own course on Virtual Machines. Another of those dreams…

Lectures

I have to admit, the original abstracts don’t quite represent the final lectures. So, I’ll also include the outlines in addition to the slides.

Interpreters: Everywhere And All The Time

Implementers often start with an interpreter to sketch how a language may work. They are easy to implement and great to experiment with. However, they are also an essential part of dynamic language implementations. We will talk about the basics of abstract syntax trees, bytecodes, and how these ideas can be used to implement a language. We will also look into optimizations for interpreters: how AST and bytecode interpreters can use run-time feedback to improve performance, and discuss how super nodes and super instructions allows us to make effective use of modern CPUs.

Outline
  • How are programming languages implemented?
  • Types of interpreters
    • abstract syntax tree
    • bytecode
  • Interpreter optimizations
    • Lookup caching
    • AST/bytecode-level inlining
    • Library lowering, library intrinsification
    • Super nodes, super instructions
    • Self-optimization, bytecode quickening
Slides

A Brief Introduction to Just-in-Time Compilation

Since the early days of object-oriented languages, run-time polymorphism has been a challenge for implementers. Smalltalk and Self pushed many ideas to an extreme, their implementers had to invent techniques such as: lookup caches, tracing and method-based compilation, deoptimization, and maps. While these ideas originated in the ’80s and ‘90s, they are key ingredients of today’s just-in-time compilers for Java, Ruby, Python, JavaScript.

Outline
  • Just-in-time compilation
    • Basic assumptions and application behavior
    • Selection of compilation units
    • Executing Dynamic Languages
    • Using Run-Time Feedback
    • Metacompilation
  • Efficient Data Representation
    • Maps, hidden classes, shapes
    • Storage strategies
    • Handling concurrency and parallelism
Slides

If you have any questions, I am more than happy to answer, possibly on Twitter @smarr or Mastodon.

Squeezing a Little More Performance Out of Bytecode Interpreters

Earlier this year, Wanhong Huang, Tomoharu Ugawa, and myself published some new experiments on interpreter performance. We experimented with a Genetic Algorithm to squeeze a little more performance out of bytecode interpreters. Since I spent much of my research time looking for ways to improve interpreter performance, I was quite intrigued by the basic question behind Wanhong’s experiments: which is the best order of bytecode handlers in the interpreter loop?

The Basics: Bytecode Loops and Modern Processors

Let’s start with a bit of background. Many of today’s widely used interpreters use bytecodes, which represent a program as operations quite similar to processor instructions. Though, depending on the language we are trying to support in our interpreter, the bytecodes can be arbitrarily complex, in terms of how they encode arguments, but also in terms of the behavior they implement.

In the simplest case, we would end up with an interpreter loop that looks roughly like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
uint8_t bytecode = { push_local, push_local, add, /* ... */ }
while (true) {
  uint8_t bytecode = bytecodes[index];
  index += /* ... */;
  switch (bytecode) {
    case push_local:
      // ...
    case add:
      // ...
    case call_method:
      // ...
  }
}
Listing 1: switch/case Interpreter Loop.

Here, push_local and add are much simpler than any call_method bytecode. Depending on the language that we try to implement, push_local is likely just a few processor instructions, while call_method might be significantly more complex, because it may need to lookup the method, ensure that arguments are passed correctly, and ensure that we have memory for local variables for the method that is to be executed. Since bytecodes can be arbitrarily complex, S. Brunthaler distinguished between high abstraction-level interpreters and low abstraction-level ones. High abstraction-level interpreters do not spend a lot of time on the bytecode dispatch, but low abstraction-level ones do, because their bytecodes are comparably simple, and have often just a handful of processor instructions. Thus, low abstraction-level interpreters would benefit most from optimizing the bytecode dispatch.

A classic optimization of the bytecode dispatch is threaded code interpretation, in which we represent a program not only using bytecodes, but with an additional array of jump addresses. This optimization is also often called direct threaded code. It is particularly beneficial for low abstraction-level interpreters but applied more widely.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
uint8_t bytecode = { push_local, push_local, add, /* ... */ }
void* targets = { &&push_local, &&push_local, &add, /* ... */ }

push_local:
  // ...
  void* target = targets[index];
  index += /* ... */
  goto *target;

add:
  // ...
  void* target = targets[index];
  index += /* ... */
  goto *target;

call_method:
  // ...
  void* target = targets[index];
  index += /* ... */
  goto *target;
Listing 2: Directed Threaded Interpreter.

With this interpreter optimization, we do not have the explicit while loop. Instead, we have goto labels for each bytecode handler and each handler has a separate copy of the dispatch code, that is the goto jump instruction.

This helps modern processors in at least two ways:

  1. it avoids an extra jump to the top of the loop at the end of each bytecode,

  2. and perhaps more importantly, we have multiple dispatch points instead of a single one.

This is important for branch prediction. In our first loop, a processor would not be able to predict where a jump is going, because the switch normally translates to a single jump that goes to most, if not all bytecodes. Though, when we have the jump at the end of a bytecode handler, we may only see a subset of bytecodes, which increases the chance that the processor can predict the jump correctly.

Unfortunately, modern processors are rather complex. They have limits for instance for how many jump targets they can remember. They may also end up combining the history for different jump instructions, perhaps because they use an associative cache based on the address of the jump instruction. They have various different caches, including the instruction cache, into which our interpreter loop ideally fits for best performance. And all these things may interact in unexpected ways.

For me, these things make it pretty hard to predict the performance of a bytecode loop for a complex language such as JavaScript or Ruby.

And if it’s too hard to understand, why not try some kind of machine learning. And that’s indeed what Wanhong and Tomoharu came up with.

Towards an Optimal Bytecode Handler Ordering

With all the complexity of modern processors, Wanhong and Tomoharu observed that changing the order of bytecode handlers can make a significant difference for the overall performance of an interpreter. Of course, this will only make a difference if our interpreter indeed spends a significant amount of time in the bytecode loop and the handler dispatch itself.

When looking at various interpreters, we will find most of them use a natural order. With that I mean, the bytecode handlers are in the order of the numbers assigned to each bytecode. Other possible orders could be a random order, or perhaps even an order based on the frequency of bytecodes or the frequency of bytecode sequences. Thus, one might simply first have the most frequently used bytecodes, and then less frequently used ones, perhaps hoping that this means the most used instructions fit into caches, or help the branch predictor in some way.

The goal of our experiments was to find out whether we can use a Genetic Algorithm to find a better ordering so that we can improve interpreter performance. We use a genetic algorithm to create new orders of bytecode handlers by producing a crossover from two existing orderings that combine both with a few handlers being reordered additionally, which adds mutation into the new order. The resulting bytecode handler order is then compile into a new interpreter for which we measure the run time of a benchmark. With a Genetic Algorithm, one can thus generate variations of handler orders that over multiple generations of crossover and mutation may evolve to a faster handler order.

I’ll skip the details here, but please check out the paper below for the specifics.

Results on a JavaScript Interpreter

So, how well does this approach work? To find out, we applied it to the eJSVM, a JavaScript interpreter that is designed for resource-constraint devices.

In the context of resource-constraint embedded devices, it may make sense to tailor an interpreter to a specific applications to gain best performance. Thus we started by optimizing the interpreter for a specific benchmark on a specific machine. To keep the time needed for the experiments manageable, we used three Intel machines and one Raspberry Pi with an ARM CPU. In many ways, optimizing for a specific benchmark is the best-case scenario, which is only practical if we can deploy a specific application together with the interpreter. Figure 1 shows the results on benchmarks from the Are We Fast Yet benchmark suite. We can see that surprisingly large improvements. While the results depend very much on the processor architecture, every single benchmark sees an improvement on all platforms.

(a) Intel Core i9-12900 with GCC 10.3
(b) Intel Core i7-11700 with GCC 9.4
(c) Intel Xeon W-2235 with GCC 9.4
(d) ARM Cortex A53 with GCC 10.2.1
Fig. 1: Speedup over the baseline interpreter after optimizing it for the given benchmark, on the given processor. The results suggest a strong influence for the processor architecture and benchmark. On the Intel Xeon, we see a speedup of up to 23% on a specific benchmark and that every single benchmark can benefit.

Unfortunately, we can’t really know which programs user will run on our interpreters for all scenarios. Thus, we also looked at how interpreter speed improves when we train the interpreter on a single benchmark. Figure 2 shows how the performance changes when we train the interpreter for a specific benchmark. In the top left corner, we see the results when training for the Bounce benchmark. While Bounce itself sees a 7.5% speedup, the same interpreter speeds up the List benchmark by more than 12%. Training the interpreter on the Permute benchmark gives how ever much less improvements for the other benchmarks.

Fig. 2: Speedup of benchmarks on interpreters trained for a specific benchmark on the Intel Xeon W-2235 with GCC 9.4. The gains here depend strongly on the benchmark. While training with CD or Queens gives an average speedup of more than 10% across all benchmarks, training on Permute only gives about 3%.

In the paper, we look at a few more aspects including which Genetic Algorithm works best and how portable performance is between architectures.

Optimizing Bytecode Handler Order for other Interpreters

Reading this blog post, you may wonder how to best go about experimenting with your own interpreter. We also briefly tried optimizing CRuby, however, we unfortunately did not yet manage to find time to continue, but we found a few things that one needs to watch out for when doing so.

First, you may have noticed that we used a relatively old versions of GCC. For eJSVM, these gave good results and did not interfere with our reordering. However, on CRuby and with newer GCCs, the compiler will start to reorder basic blocks itself, which makes it harder to get the desired results. Here flags such as -fno-reorder-blocks or -fno-reorder-blocks-and-partition may be needed. Clang didn’t seem to reorder basic blocks in the interpreter loop. As a basic test of how big a performance impact might be,

I simply ran a handful of random bytecode handler orders, which I would normally would expect to show some performance difference, likely a slowdown. Though, for CRuby I did not see a notable performance change, which suggests that bytecode dispatch may not be worth optimizing further. But it’s a bit early to tell conclusively at this point. We should give CPython and others a go, but haven’t gotten around to it just yet.

Conclusion

If you care about interpreter performance, maybe it’s worth to take a look at the interpreter loop and see whether modern processors deliver better performance when bytecode handlers get reordered.

Our results suggest that it can give large improvements when training for a specific benchmark. There is also still a benefit for other benchmarks that we did not train for, though, it depends on how similar the training benchmark is to the others.

For more details, please read the paper linked below, or reach out on Twitter @smarr.

Abstract

Interpreter performance remains important today. Interpreters are needed in resource constrained systems, and even in systems with just-in-time compilers, they are crucial during warm up. A common form of interpreters is a bytecode interpreter, where the interpreter executes bytecode instructions one by one. Each bytecode is executed by the corresponding bytecode handler.

In this paper, we show that the order of the bytecode handlers in the interpreter source code affects the execution performance of programs on the interpreter. On the basis of this observation, we propose a genetic algorithm (GA) approach to find an approximately optimal order. In our GA approach, we find an order optimized for a specific benchmark program and a specific CPU.

We evaluated the effectiveness of our approach on various models of CPUs including x86 processors and an ARM processor. The order found using GA improved the execution speed of the program for which the order was optimized between 0.8% and 23.0% with 7.7% on average. We also assess the cross-benchmark and cross-machine performance of the GA-found order. Some orders showed good generalizability across benchmarks, speeding up all benchmark programs. However, the solutions do not generalize across different machines, indicating that they are highly specific to a microarchitecture.

  • Optimizing the Order of Bytecode Handlers in Interpreters using a Genetic Algorithm
    W. Huang, S. Marr, T. Ugawa; In The 38th ACM/SIGAPP Symposium on Applied Computing (SAC '23), SAC'23, p. 10, ACM, 2023.
  • Paper: PDF
  • DOI: 10.1145/3555776.3577712
  • BibTex: bibtex
    @inproceedings{Huang:2023:GA,
      abstract = {Interpreter performance remains important today. Interpreters are needed in
      resource constrained systems, and even in systems with just-in-time compilers,
      they are crucial during warm up. A common form of interpreters is a bytecode
      interpreter, where the interpreter executes bytecode instructions one by one.
      Each bytecode is executed by the corresponding bytecode handler.
      
      In this paper, we show that the order of the bytecode handlers in the
      interpreter source code affects the execution performance of programs on the
      interpreter. On the basis of this observation, we propose a genetic algorithm
      (GA) approach to find an approximately optimal order. In our GA approach, we
      find an order optimized for a specific benchmark program and a specific CPU.
      
      We evaluated the effectiveness of our approach on various models of CPUs
      including x86 processors and an ARM processor. The order found using GA
      improved the execution speed of the program for which the order was optimized
      between 0.8% and 23.0% with 7.7% on average. We also assess the cross-benchmark
      and cross-machine performance of the GA-found order. Some orders showed good
      generalizability across benchmarks, speeding up all benchmark programs.
      However, the solutions do not generalize across different machines, indicating
      that they are highly specific to a microarchitecture.},
      author = {Huang, Wanhong and Marr, Stefan and Ugawa, Tomoharu},
      blog = {https://stefan-marr.de/2023/06/squeezing-a-little-more-performance-out-of-bytecode-interpreters/},
      booktitle = {The 38th ACM/SIGAPP Symposium on Applied Computing (SAC '23)},
      doi = {10.1145/3555776.3577712},
      isbn = {978-1-4503-9517-5/23/03},
      keywords = {Bytecodes CodeLayout EmbeddedSystems GeneticAlgorithm Interpreter JavaScript MeMyPublication Optimization myown},
      month = mar,
      pages = {10},
      pdf = {https://stefan-marr.de/downloads/acmsac23-huang-et-al-optimizing-the-order-of-bytecode-handlers-in-interpreters-using-a-genetic-algorithm.pdf},
      publisher = {ACM},
      series = {SAC'23},
      title = {{Optimizing the Order of Bytecode Handlers in Interpreters using a Genetic Algorithm}},
      year = {2023},
      month_numeric = {3}
    }
    

Older Posts

Subscribe via RSS