Instrumentation-based Profiling on JVMs is Broken!
Last year, we looked at how well sampling profilers work on top of the JVM. Unfortunately, they suffer from issues such as safepoint bias and may not correctly attribute observed run time to the correct methods because of the complexities introduced by inlining and other compiler optimizations.
After looking at sampling profilers, Humphrey started to investigate instrumentation-based profilers and found during his initial investigation that they were giving much more consistent numbers. Unfortunately, it became quickly clear that the state-of-the-art instrumentation-based profilers on the JVM also have major issues, which results in profiles that are not representative of production performance. Since profilers are supposed to help us identify performance issues, they fail at their one job.
When investigating them further, we found that they interact badly with inlining and other standard optimizations. Because the profilers we found instrument JVM bytecodes, they add a lot of extra code that compiler optimizations treat as any other application code. While this does not strictly prevent optimizations such as inlining, the extra code interferes enough with the optimization that the observable behavior of a program with and without inlining is basically identical. In practice, this means that instrumentation-based profilers on the JVM are easily portable, but they can’t effectively guide developers to the code that would benefit most from attention, which is their main purpose.
Profilers that do not capture production performance will misguide us!
While they can still identify the code that is activated most often, the interaction with optimizations means that developers see mostly unoptimized behavior. With today’s highly optimizing compilers this is unfortunate, because we may end up optimizing code that the compiler normally would have optimized for us already, and we spend time on things that likely won’t make a difference in production.
Let’s look at an example from our paper:
In this arguably a little contrived example, we use some kind of framework,
for which we have actions that the framework applies for us.
This is probably a worst case for profilers that instrument bytecodes.
Here, the execute()
methods would be identified as the most problematic
aspect. Though, they don’t do anything.
A just-in-time compiler like HotSpot’s C2,
would likely end up seeing a bimorphic call site to execute()
and inline both methods.
And if the compiler heuristics are with us, it might even optimize out the empty loop
in the framework.
So, if we assume a sufficiently smart compiler, here our inefficient code,
that’s forced on us by a framework, is being taken care of by the compiler.
And a good profiler, would ideally guide us to the bubbleSortById(.)
as being of interest.
Typically, we’d expect to get a good speedup here by switching to a more suitable
sort, especially since we implicitly assume there are many actions so that this code matters
in production.
To me this means, instrumentation-based profilers can only be a matter of last resort when sampling with its own flaws fails. They are just not useful enough as they are.
Can we do better than profilers that instrument bytecode?
At the time, Humphrey was quite in favor of instrumentation, because it gives very consistent results. So, he wanted to make the results of instrumentation-based profilers more realistic. Inspired by the work of Basso et al., he built an instrumentation-based profiler into the Graal just-in-time compiler that works more like classic instrumentation-based profilers for ahead-of-time-compiled language implementations.
The basic idea is illustrated below:
Instead of inserting the instrumentation right when the bytecode is loaded, for instance with an agent or some other form of bytecode rewriting, we move the addition of instrumentation code to a much later part of the just-in-time compilation. Most importantly, we insert it only after inlining and most optimizations are performed. To keep the prototype simple, we insert the probes right before it is turned into the lower level IR. At this point, there are still a few optimizations to be performed, including instruction selection and register allocation. Though, in the grand scheme of things, these are minor.
How much better is it?
With his prototype, Humphrey managed to achieve not only much better performance than classic instrumentation-based profilers, but also minimize interference with optimizations. For a rough idea of the overall performance impact of this approach, let’s have a look at Figure 2:
With a few extra tricks briefly sketched in the paper, we get good attribution of where time is spent, even in the presence of inlining, reduce overhead, and benefit from the more precise results of instrumentation, because it does not have the same drawbacks of only occasionally obtaining data.
There’s one major open question though: what does a correct profile look like? At the moment, we can’t assess whether our approach is correct. Sampling profilers, as we saw last year, also do not agree on a single answer. So, while we believe our approach is much better than classic instrumentation, we still need to find out how correct it is.
All results so far, and a few more technical details are in the paper linked below. Questions, pointers, and suggestions are greatly appreciated perhaps on Mastodon or Twitter @smarr.
Abstract
Profilers are crucial tools for identifying and improving application performance. However, for language implementations with just-in-time (JIT) compilation, e.g., for Java and JavaScript, instrumentation-based profilers can have significant overheads and report unrealistic results caused by the instrumentation.
In this paper, we examine state-of-the-art instrumentation-based profilers for Java to determine the realism of their results. We assess their overhead, the effect on compilation time, and the generated bytecode. We found that the profiler with the lowest overhead increased run time by 82x. Additionally, we investigate the realism of results by testing a profiler’s ability to detect whether inlining is enabled, which is an important compiler optimization. Our results document that instrumentation can alter program behavior so that performance observations are unrealistic, i.e., they do not reflect the performance of the uninstrumented program.
As a solution, we sketch late-compiler-phase-based instrumentation for just-in-time compilers, which gives us the precision of instrumentation-based profiling with an overhead that is multiple magnitudes lower than that of standard instrumentation-based profilers, with a median overhead of 23.3% (min. 1.4%, max. 464%). By inserting probes late in the compilation process, we avoid interfering with compiler optimizations, which yields more realistic results.
- Towards Realistic Results for Instrumentation-Based Profilers for JIT-Compiled Systems
H. Burchell, O. Larose, S. Marr; In Proceedings of the 21st ACM SIGPLAN International Conference on Managed Programming Languages and Runtimes, MPLR'24, ACM, 2024. - Paper: PDF
- DOI: 10.1145/3679007.3685058
-
BibTex:
bibtex
@inproceedings{Burchell:2024:InstBased, abstract = {Profilers are crucial tools for identifying and improving application performance. However, for language implementations with just-in-time (JIT) compilation, e.g., for Java and JavaScript, instrumentation-based profilers can have significant overheads and report unrealistic results caused by the instrumentation. In this paper, we examine state-of-the-art instrumentation-based profilers for Java to determine the realism of their results. We assess their overhead, the effect on compilation time, and the generated bytecode. We found that the profiler with the lowest overhead increased run time by 82x. Additionally, we investigate the realism of results by testing a profiler’s ability to detect whether inlining is enabled, which is an important compiler optimization. Our results document that instrumentation can alter program behavior so that performance observations are unrealistic, i.e., they do not reflect the performance of the uninstrumented program. As a solution, we sketch late-compiler-phase-based instrumentation for just-in-time compilers, which gives us the precision of instrumentation-based profiling with an overhead that is multiple magnitudes lower than that of standard instrumentation-based profilers, with a median overhead of 23.3% (min. 1.4%, max. 464%). By inserting probes late in the compilation process, we avoid interfering with compiler optimizations, which yields more realistic results.}, author = {Burchell, Humphrey and Larose, Octave and Marr, Stefan}, blog = {https://stefan-marr.de/2024/09/instrumenation-based-profiling-on-jvms-is-broken/}, booktitle = {Proceedings of the 21st ACM SIGPLAN International Conference on Managed Programming Languages and Runtimes}, doi = {10.1145/3679007.3685058}, keywords = {Graal Instrumentation JVM Java MeMyPublication Optimization Profiler Profiling Sampling myown}, month = sep, pdf = {https://stefan-marr.de/downloads/mplr24-burchell-et-al-towards-realistic-results-for-instrumentation-based-profilers-for-jit-compiled-systems.pdf}, publisher = {ACM}, series = {MPLR'24}, title = {{Towards Realistic Results for Instrumentation-Based Profilers for JIT-Compiled Systems}}, year = {2024}, month_numeric = {9} }