The Changing “Guarantees” Given by Python's Global Interpreter Lock

In this blog post, I will look into the implementation details of CPython’s Global Interpreter Lock (GIL) and how they changed between Python 3.9 and the current development branch that will become Python 3.13.

My goal is to understand which concrete “guarantees” the GIL gives in both versions, which “guarantees” it does not give, and which ones one might assume based on testing and observation. I am putting “guarantees” in quotes, because with a future no-GIL Python, none of the discussed properties should be considered language guarantees.

While Python has various implementations, including CPython, PyPy, Jython, IronPython, and GraalPy, I’ll focus on CPython as the most widely used implementation. Though, PyPy and GraalPy also use a GIL, but their implementations subtly differ from CPython’s, as we will see a little later.

1. What Is the GIL?

Let’s recap a bit of background. When CPython started to support multiple operating system threads, it became necessary to protect various CPython-internal data structures from concurrent access. Instead of adding locks or using atomic operations to protect the correctness of for instance reference counting, the content of lists, dictionaries, or internal data structures, the CPython developers decided to take a simpler approach and use a single global lock, the GIL, to protect all of these data structures from incorrect concurrent accesses. As a result, one can start multiple threads in CPython, though only a single of them runs Python bytecode at any given time.

The main benefit of this approach is its simplicity and single-threaded performance. Because there’s only a single lock to worry about, it’s easy to get the implementation correct without risking deadlocks or other subtle concurrency bugs at the level of the CPython interpreter. Thus, the GIL represented a suitable point in the engineering trade-off space between correctness and performance.

2. Why Does the Python Community Think About Removing the GIL?

Of course, the obvious downside of this design is that only a single thread can execute Python bytecode at any given time. I am talking about Python bytecode here again, because operations that may take a long time, for instance reading a file into memory, can release the GIL and allow other threads to run in parallel.

For programs that spend most of their time executing Python code, the GIL is of course a huge performance bottleneck, and thus, PEP 703 proposes to make the GIL optional. The PEP mentions various use cases, including machine learning, data science, and other numerical applications.

3. Which “Guarantees” Does the GIL Provide?

So far, I only mentioned that the GIL is there to protect CPython’s internal data structures from concurrent accesses to ensure correctness. However, when writing Python code, I am more interested in the “correctness guarantees” the GIL gives me for the concurrent code that I write. To know these “correctness guarantees”, we need to delve into the implementation details of when the GIL is acquired and released.

The general approach is that a Python thread obtains the GIL when it starts executing Python bytecode. It will hold the GIL as long as it needs to and eventually release it, for instance when it is done executing, or when it is executing some operation that often would be long-running and itself does not require the GIL for correctness. This includes for instance the aforementioned file reading operation or more generally any I/O operation. However, a thread may also release the GIL when executing specific bytecodes.

This is where Python 3.9 and 3.13 differ substantially. Let’s start with Python 3.13, which I think roughly corresponds to what Python has been doing since version 3.10 (roughly since this PR). Here, the most relevant bytecodes are for function or method calls as well as bytecodes that jump back to the top of a loop or function. Thus, only a few bytecodes check whether there was a request to release the GIL.

In contrast, in Python 3.9 and earlier versions, the GIL is released at least in some situations by almost all bytecodes. Only a small set of bytecodes including stack operations, LOAD_FAST, LOAD_CONST, STORE_FAST, UNARY_POSITIVE, IS_OP, CONTAINS_OP, and JUMP_FORWARD do not check whether the GIL should be released.

These bytecodes all use the CHECK_EVAL_BREAKER() on 3.13 (src) or DISPATCH() on 3.9 (src), which eventually checks (3.13, 3.9) whether another thread requested the GIL to be released by setting the GIL_DROP_REQUEST bit in the interpreter’s state.

What makes “atomicity guarantees” more complicated to reason about is that this bit is set by threads waiting for the GIL based on a timeout (src). The timeout is specified by sys.setswitchinterval().

In practice, what does this mean?

For Python 3.13, this should mean that a function that contains only bytecodes that do not lead to a CHECK_EVAL_BREAKER() check should be atomic.

For Python 3.9, this means a very small set of bytecode sequences can be atomic, though, except for a tiny set of specific cases, one can assume that a bytecode sequence is not atomic.

However, since the Python community is taking steps that may lead to the removal of the GIL, the changes in recent Python versions to give much stronger atomicity “guarantees” are likely a step in the wrong direction for the correctness of concurrent Python code. I mean this in the sense of people to accidentally rely on these implementation details, leading to hard to find concurrency bugs when running on a no-GIL Python.

4. Which Guarantees Might One Incorrectly Assume the GIL Provides?

Thanks to @cfbolz, I have at least one very concrete example of code that someone assumed to be atomic:

request_id = self._next_id
self._next_id += 1

The code tries to solve a classic problem: we want to hand out unique request ids, but it breaks when multiple threads execute this code at the same time, or rather interleaved with each other. Because then we end up getting the same id multiple times. This concrete bug was fixed by making the reading and incrementing atomic using a lock.

On Python 3.9, we can relatively easily demonstrate the issue:

def get_id(self):
    # expected to be atomic: start
    request_id = self.id
    self.id += 1
    # expected to be atomic: end

    self.usage[request_id % 1_000] += 1

Running this on multiple threads will allow us to observe an inconsistent number of usage counts. They should be all the same, but they are not. Arguably, it’s not clear whether the observed atomicity issue is from the request_id or the usage counts, but the underlying issue is the same in both cases. For the full example see 999_example_bug.py.

This repository contains a number of other examples that demonstrate the difference between different Python implementations and versions.

Generally, on Python 3.13 most bytecode sequences without function calls will be atomic. On Python 3.9, much few are, and I believe that would be better to avoid people from creating code that relies on the very strong guarantees that Python 3.13 gives.

As mentioned earlier, because the GIL is released based on a timeout, one may also perceive bytecode sequences as atomic when experimenting.

Let’s assume we run the following two functions on threads in parallel:

def insert_fn(list):
    for i in range(100_000_000):
        list.append(1)
        list.append(1)
        list.pop()
        list.pop()
    return True


def size_fn(list):
    for i in range(100_000_000):
        l = len(list)
        assert l % 2 == 0, f"List length was {l} at attempt {i}"
    return True

Depending on how fast the machine is, it may take 10,000 or more iterations of the loop in size_fn before we see the length of the list to be odd. This means it takes 10,000 iterations before the function calls to append or pop allowed the GIL to be released before the second append(1) or after the first pop().

Without looking at the CPython source code, one might have concluded easily that these bytecode sequences are atomic.

Though, there’s a way to make it visible earlier. By setting the thread switch interval to a very small value, for instance with sys.setswitchinterval(0.000000000001), one can observe an odd list length after only a few or few hundred iterations of the loop.

5. Comparing Observable GIL Behavior for Different CPython Versions, PyPy, and GraalPy

In my gil-sem-demos repository, I have a number of examples that try to demonstrate observable differences in GIL behavior.

Of course, the very first example tries to show the performance benefit of running multiple Python threads in parallel. Using the no-GIL implementation, one indeed sees the expected parallel speedup.

On the other tests, we see the major differences between Python 3.8 - 3.9 and the later 3.10 - 3.13 versions. The latter versions usually execute the examples without seeing results that show a bytecode-level atomicity granularity. Instead, they suggest that loop bodies without function calls are pretty much atomic.

For PyPy and GraalPy, it is also harder to observe the bytecode-level atomicity granularity, because they are simply faster. Lowering the switch interval makes it a little more observable, except for GraalPy, which likely aggressively removes the checks for whether to release the GIL.

Another detail for the no-GIL implementation: it crashes for our earlier bug example. It complains about *** stack smashing detected ***.

A full log is available as a gist.

6. Conclusion

In this blog post, I looked into the implementation details of CPython’s Global Interpreter Lock (GIL). The semantics between Python 3.9 and 3.13 differ substantially. Python 3.13 gives much stronger atomicity “guarantees”, releasing the GIL basically only on function calls and jumps back to the top of a loop or function.

If the Python community intends to remove the GIL, this seems problematic. I would expect more people to implicitly rely on these much stronger guarantees, whether consciously or not.

My guess would be that this change was done mostly in an effort to improve the single-threaded performance of CPython.

To enable people to test their code on these versions closer to semantics that match a no-GIL implementation, I would suggest to add a compile-time option to CPython that forces a GIL release and thread switch after bytecodes that may trigger behavior visible to other threads. This way, people would have a chance to test on a stable system that is closer to the future no-GIL semantics and probably only minimally slower at executing unit tests.

Any comments, suggestions, or questions are also greatly appreciated perhaps on Twitter @smarr or Mastodon.

Which Interpreters are Faster, AST or Bytecode?

This post is a brief overview of our new study of abstract-syntax-tree and bytecode interpreters on top of RPython and the GraalVM metacompilation systems, which we are presenting next week at OOPSLA.

Prelude: Why Did I Build a Language on Top of the Truffle and RPython Metacompilation Systems?

Writing this post, I realized that I have been working on interpreters on top of metacompilation systems for 10 years now. And in much of that time, it felt to me that widely held beliefs about interpreters did not quite match my own experience.

In summer 2013, I started to explore new directions for my research after finishing my PhD just in January that year. The one question that was still bugging me back then was how I could show that my PhD ideas of using metaobject protocols to realize different concurrency models was not just possible, but perhaps even practical.

During my PhD, I worked with a bytecode interpreter written in C++ and now had the choice to spend the next few years writing a just-in-time compiler for this thing, which is never going to be able to keep up with what state-of-the-art VMs offered, or finding another way to get good performance.

To my surprise, there was another way. Though even today, some say that metacompilation is ludicrous, and will never be practical. And others, simply use PyPy as their secret sauce and enjoy better Python performance…

Though, I didn’t start with PyPy, or rather its RPython metacompilation framework. Instead, I found the Truffle framework with its Graal just-in-time compiler and implemented the Simple Object Machine (SOM) on top of it, a dynamic language I had been using for a while already. Back then, Truffle and Graal were very new, and it took a while to get my TruffleSOM to reach the desired “state-of-the-art performance”, but I got there eventually. On the way to reaching that point, I also implemented PySOM on top of PyPy’s RPython framework. This was my way to get a better understanding of metacompilation more generally But enough of history…

The Promise of Metacompilation: State-of-the-Art Performance for Little Effort

If you’re Google, you can afford to finance your own JavaScript virtual machine with a state-of-the-art interpreter, state-of-the-art garbage collector, and no less than three different just-in-time compilers, which of course are also “state of the art”. Your average academic, and even large language communities such as those around Ruby, Python, and PHP do not have the resources to build such VMs.1 1 Of course, reality is more complicated, but I’ll skip over it.

This is where metacompilation comes in and promises us that we can reuse existing compilers, garbage collectors, and all the other parts of an existing high-level language VM. All we need to do is implement our language as an interpreter on top of something like the GraalVM or RPython systems.

That’s the promise. And, for a certain set of benchmarks, and a certain set of use cases, these systems deliver exactly that: reuse of these components. We still have to implement an interpreter suitable for these systems though. While this is no small feat, it’s something “an average” academic can do with enough time and stubbornness, and there are plenty of examples using Truffle and RPython.

And my SOM implementation, indeed manages to hold its own compared to Google’s V8:

Figure 1. Just-in-time-compiled peak performance of the Are We Fast Yet benchmarks, shown as an aggregate over all benchmarks on a logarithmic scale, with Java 17 (HotSpot) as the baseline. TSOMAST reaches the performance of Node.js, while the peak performance of the other implementations is a little further behind.

As we can see here, both V8 inside of Node.js and TSOMAST, which is short for the abstract-syntax-tree-based TruffleSOM interpreter, are roughly in the same range of being 1.7× to 2.7× slower than the HotSpot JVM. SOM as a dynamic language similar to JavaScript, and easily within a range of ±50% of performance to V8, I’d argue that metacompilation indeed lives up to its promise.

Of course, the used Are We Fast Yet benchmarks test only a relatively small common part of these languages, but they show that the core language elements common to Java, JavaScript, and other object-oriented languages reach roughly the same level of performance.

Interpreters for Metacompilation Systems

However, we wanted to talk about abstract-syntax-tree (AST) and bytecode interpreters. The goal of our work was to investigate the difference between these two approaches to build interpreters on top of GraalVM and RPython.

To this end, we had to implement no less than four interpreters, two AST interpreters and two bytecode interpreters, one each on the two different systems. We also had to optimize them roughly to the same level, so that we can draw conclusions from the comparison. While AST and bytecode interpreters naturally lend themselves to different optimizations, we didn’t stop at what’s commonly done. Instead, we implemented the classic optimizations to gain the key performance benefits, and once we hit diminishing returns, we added the optimizations from the AST interpreters to the bytecode ones or the other way around, so that they roughly implement the same set of optimizations. Currently, this includes (see Section 4 of the paper for more details):

  • polymorphic lookup/inline caching
  • inlining of control structures and anonymous functions in specific situations
  • superinstructions for bytecode interpreters and supernodes for AST interpreters
  • bytecode quickening and self-optimizing AST nodes
  • lowering/intrinsifying of basic standard library methods
  • caching of globals

Some other classic interpreter optimizations were not directly possible on top of the metacompilation systems. This includes indirect threading and top-of-stack caching for bytecode interpreters. Well, more precisely, we experimented a little with both, but they didn’t give the desired benefits and rather slowed the interpreters down, and we only made them work on a few benchmarks. Pushing this further will require extensive changes to the metacompilation systems as far as we can tell from talking to people working on GraalVM and RPython. So, future work…

With all these optimizations, we reached a point where adding further optimizations showed only minimal gains, typically specific to a benchmark, which gives us some confidence that our results are meaningful.

The final results are as follows:

Figure 2. Interpreter-only run-time performance of the Are We Fast Yet benchmarks, on a logarithmic scale, with Java 17 as baseline. While TSOM (TruffleSOM) and PySOM are overall slower than HotSpot's Java interpreter and Node.js/V8's Ignition interpreter, we can also observe that PySOMAST and TSOMAST are faster than the bytecode versions. TSOMBC is the slowest interpreter overall.

Since we are confident that both types of interpreters are optimized roughly to the same level, we conclude that bytecode interpreters do not have their traditional advantage on top of metacompilation systems when it comes to pure interpreter speed.

Based on our current understanding, we attribute that to some general challenges metacompilation systems face when producing native code for the bytecode interpreter loops. The GraalVM for instance, does use the Graal IR, which only supports structured control flow. This means, we cannot directly encode arbitrary jumps between bytecodes, and the compiler also struggles with the size of bytecode loops, leaving some optimization opportunities on the table. The bytecode loops of a standard interpreter can be multiple tens or hundreds of kilobytes of native code, where bytecode interpreters written in C/C++ are typically much more concise.

We also looked at memory use, and indeed, on that metric bytecode interpreters win compared to AST interpreters. However, the difference is not as stark as one might expect, and bytecode interpreters may also require more allocations based on how they are structured for boxing or other run-time data structures.

Though, this blog post is just a high-level overview. For the details, please see the paper below.

Recommendations: AST or Bytecode?

Based on our results, what would I do, if I had to implement another language on top of Truffle or RPython? Well, as always, it really depends… Let’s look at the two ends of the spectrum I can think of:

An Established, widely used Language: For this, I’d assume that a bytecode has been defined, and it is going to evolve slowly in the future. I’d also assume it has possibly large existing programs with a lot of code, i.e., hundreds of thousands of lines of code. For these types of languages, I’d suggest to stick with the bytecode. Bytecode is fast to load and lots of code is likely executed only once or not at all. This means the memory benefits of the compact representation likely outweigh other aspects, and we get decent performance.

A Completely new Language: Here I will assume the language will first need to find is way and design may frequently change. On top of metacompilation systems, we can get really good performance, and there’s not a lot of code for our language, and our users are unlikely to care too much about loading huge amounts of code. Here, AST interpreters in the Truffle style are likely the more flexible choice. You don’t need to design a bytecode, and can instead focus on the language and getting to acceptable performance first. Later, once you have larger code bases with their own performance challenges one may still think about designing a bytecode, but I would think the fewest languages will ever get there.

For languages in between these two ends of the spectrum, one would probably want to weigh up engineering effort, which I’d think to be lower for AST interpreters, and memory use for large code bases, where bytecode interpreters are better.

AST vs. Bytecode: Interpreters in the Age of Meta-Compilation

Our paper includes more on the background, our interpreter implementations, and our experimental design. It also has a more in-depth analysis on the performance properties we observed in terms of run time and memory use. To guide the work of language implementers, we also looked at the various optimizations, to see which ones are most important to gain interpreter as well as just-in-time compiled performance.

As artifact, our paper comes with a Docker image that includes all experiments and raw data, which hopefully enables others to reproduce our results and expand on or compare to our work. The Dockerfile itself is also on GitHub, where one can also find the latest versions of TruffleSOM and PySOM. Though these two don’t yet contain the same versions as the paper, and may evolve further.

The paper is the result of a collaboration with Octave, Humphrey, and Sophie.

Any comments or suggestions are also greatly appreciated perhaps on Twitter @smarr or Mastodon.

Abstract

Thanks to partial evaluation and meta-tracing, it became practical to build language implementations that reach state-of-the-art peak performance by implementing only an interpreter. Systems such as RPython and GraalVM provide components such as a garbage collector and just-in-time compiler in a language-agnostic manner, greatly reducing implementation effort. However, meta-compilation-based language implementations still need to improve further to reach the low memory use and fast warmup behavior that custom-built systems provide. A key element in this endeavor is interpreter performance. Folklore tells us that bytecode interpreters are superior to abstract-syntax-tree (AST) interpreters both in terms of memory use and run-time performance.

This work assesses the trade-offs between AST and bytecode interpreters to verify common assumptions and whether they hold in the context of meta-compilation systems. We implemented four interpreters, each an AST and a bytecode one using RPython and GraalVM. We keep the difference between the interpreters as small as feasible to be able to evaluate interpreter performance, peak performance, warmup, memory use, and the impact of individual optimizations.

Our results show that both systems indeed reach performance close to Node.js/V8. Looking at interpreter-only performance, our AST interpreters are on par with, or even slightly faster than their bytecode counterparts. After just-in-time compilation, the results are roughly on par. This means bytecode interpreters do not have their widely assumed performance advantage. However, we can confirm that bytecodes are more compact in memory than ASTs, which becomes relevant for larger applications. However, for smaller applications, we noticed that bytecode interpreters allocate more memory because boxing avoidance is not as applicable, and because the bytecode interpreter structure requires memory, e.g., for a reified stack.

Our results show AST interpreters to be competitive on top of meta-compilation systems. Together with possible engineering benefits, they should thus not be discounted so easily in favor of bytecode interpreters.

  • AST vs. Bytecode: Interpreters in the Age of Meta-Compilation
    O. Larose, S. Kaleba, H. Burchell, S. Marr; Proceedings of the ACM on Programming Languages, OOPSLA'23, p. 1–31, ACM, 2023.
  • Paper: HTML, PDF
  • DOI: 10.1145/3622808
  • Appendix: online appendix
  • BibTex: bibtex
    @article{Larose:2023:AstVsBc,
      abstract = {Thanks to partial evaluation and meta-tracing, it became practical to build language implementations that reach state-of-the-art peak performance by implementing only an interpreter. Systems such as RPython and GraalVM provide components such as a garbage collector and just-in-time compiler in a language-agnostic manner, greatly reducing implementation effort. However, meta-compilation-based language implementations still need to improve further to reach the low memory use and fast warmup behavior that custom-built systems provide. A key element in this endeavor is interpreter performance. Folklore tells us that bytecode interpreters are superior to abstract-syntax-tree (AST) interpreters both in terms of memory use and run-time performance.
      
      This work assesses the trade-offs between AST and bytecode interpreters to verify common assumptions and whether they hold in the context of meta-compilation systems. We implemented four interpreters, each an AST and a bytecode one using RPython and GraalVM. We keep the difference between the interpreters as small as feasible to be able to evaluate interpreter performance, peak performance, warmup, memory use, and the impact of individual optimizations.
      
      Our results show that both systems indeed reach performance close to Node.js/V8. Looking at interpreter-only performance, our AST interpreters are on par with, or even slightly faster than their bytecode counterparts. After just-in-time compilation, the results are roughly on par. This means bytecode interpreters do not have their widely assumed performance advantage. However, we can confirm that bytecodes are more compact in memory than ASTs, which becomes relevant for larger applications. However, for smaller applications, we noticed that bytecode interpreters allocate more memory because boxing avoidance is not as applicable, and because the bytecode interpreter structure requires memory, e.g., for a reified stack.
      
      Our results show AST interpreters to be competitive on top of meta-compilation systems. Together with possible engineering benefits, they should thus not be discounted so easily in favor of bytecode interpreters.},
      appendix = {https://doi.org/10.5281/zenodo.8147414},
      author = {Larose, Octave and Kaleba, Sophie and Burchell, Humphrey and Marr, Stefan},
      blog = {https://stefan-marr.de/2023/10/ast-vs-bytecode-interpreters/},
      doi = {10.1145/3622808},
      html = {https://stefan-marr.de/papers/oopsla-larose-et-al-ast-vs-bytecode-interpreters-in-the-age-of-meta-compilation/},
      issn = {2475-1421},
      journal = {Proceedings of the ACM on Programming Languages},
      keywords = {AST Bytecode CaseStudy Comparison Interpreter JITCompilation MeMyPublication MetaTracing PartialEvaluation myown},
      month = oct,
      number = {OOPSLA2},
      pages = {1--31},
      pdf = {https://stefan-marr.de/downloads/oopsla23-larose-et-al-ast-vs-bytecode-interpreters-in-the-age-of-meta-compilation.pdf},
      publisher = {{ACM}},
      series = {OOPSLA'23},
      title = {AST vs. Bytecode: Interpreters in the Age of Meta-Compilation},
      volume = {7},
      year = {2023},
      month_numeric = {10}
    }
    

Don't Blindly Trust Your Java Profiler!

How do we know on what to focus our attention when trying to optimize the performance of a program? I suspect at least some of us will reach for sampling profilers. They keep the direct impact on the program execution low, and collect stack traces every so often during the program execution. This gives us an approximate view of where a program spends its time. Though, this approximation as it turns out can be surprisingly unreliable.

Humphrey started his research work wanting to make profilers produce more directly actionable suggestions where and how to optimize programs. Though, relatively quickly we noticed that sampling profilers are not only probabilistic as one would expect, but can give widely different results between runs, which do not necessarily converge with many runs either.

In 2010, Mytkowicz et al. identified safepoint bias as a key issue for sampling profilers for Java programs. Though, their results were not quite as bad as what we were seeing, so Humphrey started to design experiments to characterize the precision and accuracy of Java profilers in more detail.

How bad does it get?

Just before getting start, we’re fully aware that this isn’t a new issue and there are quite a number of great and fairly technical blogs out there discussing a large range of issues, for instance here, here, here, and here. In our work, we will only look at fully deterministic and small pure Java benchmarks to get a better understanding of what the current situation is.

What’s the issue you may ask? Well, let’s look at an example. Figure 1 shows the profiling results of Java Flight Recorder over 30 runs on the DeltaBlue benchmark. We see 8 different methods being identified as hottest method indicated by the hatched bars in red.

Figure 1: Bar chart for top 15 methods in the DeltaBlue benchmark identified by Java Flight Recorder. A bar represents the average percentage of run time over 30 runs, and the error bars indicated the minimum and maximum values.

Of course much of this could probably be explained with the non-determinism inherent to JVMs such as HotSpot: just-in-time compilation, parallel compilation, garbage collection, etc. However, we run each benchmark not only for 30 times but also long enough to be fully compiled. So, we basically give the profiler and JVM a best-case scenario.1 1 At least to the degree that is practical. Though, benchmarking is hard, and there are many things going on in modern VMs. See also Tratt’s posts on the topic: 1, 2 And what do we get as a result? No clear indication where to start optimizing our application. However, if we would have looked at only a single profile, we may have started optimizing something that is rarely the bit of code the application spends most time on.

Fortunately, this is indeed the worst case we found.

Overall, we looked at async-profiler, Honest Profiler, Java Flight Recorder, JProfiler, perf, and YourKit. Figure 2 shows box plots for each of these profilers to indicate the range of differences we found between the minimum and maximum run-time percentage reported for each method over all benchmarks. Thus, the median isn’t too bad, but each profiler shows cases where there is more than 15% difference between some of the runs.

Figure 2: An aggregate of the differences between the minimum and maximum run-time percentage per method over all benchmarks.

The paper goes into much more detail analyzing the results by comparing profilers with themselves and among each other to be able to characterize accuracy and precision without knowing a ground truth. It also includes plots that show how the results are distributed for specific methods to identify possible sources of the observed variation.

So, for all the details, please see the paper linked below. Any pointers and suggestions are also greatly appreciated perhaps on Twitter @smarr or Mastodon.

Abstract

To identify optimisation opportunities, Java developers often use sampling profilers that attribute a percentage of run time to the methods of a program. Even so these profilers use sampling, are probabilistic in nature, and may suffer for instance from safepoint bias, they are normally considered to be relatively reliable. However, unreliable or inaccurate profiles may misdirect developers in their quest to resolve performance issues by not correctly identifying the program parts that would benefit most from optimisations.

With the wider adoption of profilers such as async-profiler and Honest Profiler, which are designed to avoid the safepoint bias, we wanted to investigate how precise and accurate Java sampling profilers are today. We investigate the precision, reliability, accuracy, and overhead of async-profiler, Honest Profiler, Java Flight Recorder, JProfiler, perf, and YourKit, which are all actively maintained. We assess them on the fully deterministic Are We Fast Yet benchmarks to have a stable foundation for the probabilistic profilers.

We find that profilers are relatively reliable over 30 runs and normally report the same hottest method. Unfortunately, this is not true for all benchmarks, which suggests their reliability may be application-specific. Different profilers also report different methods as hottest and cannot reliably agree on the set of top 5 hottest methods. On the positive side, the average run time overhead is in the range of 1% to 5.4% for the different profilers.

Future work should investigate how results can become more reliable, perhaps by reducing the observer effect of profilers by using optimisation decisions of unprofiled runs or by developing a principled approach of combining multiple profiles that explore different dynamic optimisations.

  • Don’t Trust Your Profiler: An Empirical Study on the Precision and Accuracy of Java Profilers
    H. Burchell, O. Larose, S. Kaleba, S. Marr; In Proceedings of the 20th ACM SIGPLAN International Conference on Managed Programming Languages and Runtimes, MPLR'23, p. 1–14, ACM, 2023.
  • Paper: PDF
  • DOI: 10.1145/3617651.3622985
  • Appendix: online appendix
  • BibTex: bibtex
    @inproceedings{Burchell:2023:Profilers,
      abstract = {To identify optimisation opportunities, Java developers often use sampling profilers that attribute a percentage of run time to the methods of a program. Even so these profilers use sampling, are probabilistic in nature, and may suffer for instance from safepoint bias, they are normally considered to be relatively reliable. However, unreliable or inaccurate profiles may misdirect developers in their quest to resolve performance issues by not correctly identifying the program parts that would benefit most from optimisations.
      
      With the wider adoption of profilers such as async-profiler and Honest Profiler, which are designed to avoid the safepoint bias, we wanted to investigate how precise and accurate Java sampling profilers are today. We investigate the precision, reliability, accuracy, and overhead of async-profiler, Honest Profiler, Java Flight Recorder, JProfiler, perf, and YourKit, which are all actively maintained. We assess them on the fully deterministic Are We Fast Yet benchmarks to have a stable foundation for the probabilistic profilers.
      
      We find that profilers are relatively reliable over 30 runs and normally report the same hottest method. Unfortunately, this is not true for all benchmarks, which suggests their reliability may be application-specific. Different profilers also report different methods as hottest and cannot reliably agree on the set of top 5 hottest methods. On the positive side, the average run time overhead is in the range of 1% to 5.4% for the different profilers.
      
      Future work should investigate how results can become more reliable, perhaps by reducing the observer effect of profilers by using optimisation decisions of unprofiled runs or by developing a principled approach of combining multiple profiles that explore different dynamic optimisations.},
      acceptancerate = {0.54},
      appendix = {https://github.com/HumphreyHCB/AWFY-Profilers},
      author = {Burchell, Humphrey and Larose, Octave and Kaleba, Sophie and Marr, Stefan},
      blog = {https://stefan-marr.de/2023/09/dont-blindly-trust-your-profiler/},
      booktitle = {Proceedings of the 20th ACM SIGPLAN International Conference on Managed Programming Languages and Runtimes},
      doi = {10.1145/3617651.3622985},
      keywords = {CPUSampling Comparison MeMyPublication Precision Profiling myown},
      month = oct,
      pages = {1--14},
      pdf = {https://stefan-marr.de/downloads/mplr23-burchell-et-al-dont-trust-your-profiler.pdf},
      publisher = {ACM},
      series = {MPLR'23},
      title = {{Don’t Trust Your Profiler: An Empirical Study on the Precision and Accuracy of Java Profilers}},
      year = {2023},
      month_numeric = {10}
    }
    

Older Posts

Subscribe via RSS