Python, Is It Being Killed by Incremental Improvements?

Over the past years, two major players invested into the future of Python. Microsoft’s Faster CPython team has pushed ahead with impressive performance improvements for the CPython interpreter, which has gotten at least 2x faster since Python 3.9. They also have a baseline JIT compiler for CPython, too. At the same time, Meta is worked hard on making free-threaded Python a reality to bring classic shared-memory multithreading to Python, without being limited by the still standard Global Interpreter Lock, which prevents true parallelism.

Both projects deliver major improvements to Python, and the wider ecosystem. So, it’s all great, or is it?

In my talk talk on this topic at SPLASH, which is now online, I discussed some of the aspects the Python core developers and wider community seem to not regard with the same urgency as I would hope for. Concurrency makes me scared, and I strongly believe the Python ecosystem should be scared, too, or look forward to the 2030s being “Python’s Decade of Concurrency Bugs”.

In the talk, I started out reviewing some of the changes in observable language semantics between Python 3.9 and today and discuss their implications. I previously discussed the changes around the global interpreter lock in my post on the changing “guarantees”. In the talk, I also use the example from a real bug report, to illustrate the semantic changes:

request_id = self._next_id
self._next_id += 1

It looks simple, but reveals quite profound differences between Python versions.

Since I have some old ideas lying around, I also propose a way forward. In practice though, this isn’t a small well-defined engineering or research project. So, I hope I can inspire some of you to follow me down the rabbit hole of Python’s free-threaded future.

Incidentally, the latest release of TruffleRuby now uses many of the techniques that would be useful for Python. Benoit Daloze implemented them during his PhD and we originally published the ideas back in 2018.

Questions, pointers, and suggestions are always welcome, for instance, on Mastodon, BlueSky, or Twitter.

Screen grab of recording, showing title slide and myself at the podium.

Slides

Benchmarking Language Implementations: Am I doing it right? Get Early Feedback!

Modern CPUs, operating systems, and software in general do lots of smart and hard-to-track optimizations, leading to warmup behavior, cache effects, profile pollution and other unexpected interactions. For us engineers and scientists, whether in industry or academia, this unfortunately means that we may not fully understand the system on top of which we are trying to measure the performance impact of, for instance, an optimization, a new feature, a data structure, or even a bug fix.

Many of us even treat the hardware and software we run on top of as black boxes, relying on the scientific method to give us a good degree of confidence in the understanding of the performance results we are seeing.

Unfortunately, with the complexity of today’s systems, we can easily miss important confounding variables. Did we account, e.g., for CPU frequency scaling, garbage collection, JIT compilation, and network latency correctly? If not, this can lead us down the wrong, and possibly time-consuming path of implementing experiments that do not yield the results we are hoping for, or our experiments are too specific to allow us to draw general conclusions.

So, what’s the solution? What could a PhD student or industrial researcher do when planning for the next large project?

How about getting early feedback?

Get Early Feedback at a Language Implementation Workshop!

At the MoreVMs and VMIL workshop series, we introduced a new category of submissions last year: Experimental Setups.

We solicited extended abstracts that focus on the experiments themselves before an implementation is completed. This way, the experimental setup can receive feedback and guidance to improve the chances that the experiments lead to the desired outcomes. With early feedback, we can avoid common traps and pitfalls, share best practices, and deeper understanding of the systems we are using.

With the complexity of today’s systems, one person, or even one group, is not likely to think of all the issues that may be relevant. Instead of encountering these issues only in the review process after all experiments are done, we can share knowledge and ideas ahead of time, and hopefully improve the science!

So, if you think you may benefit from such feedback, please consider submitting an extended abstract describing your experimental goals and methodology. No results needed!

The next submission deadlines for the MoreVMs’26 workshop are:

  • December 17th, 2025
  • January 12th, 2026

For questions and suggestions, find me on Mastodon, BlueSky, or Twitter, or send me an email!

Can We Know Whether a Profiler is Accurate?

If you have been following the adventures of our hero over the last couple of years, you might remember that we can’t really trust sampling profilers for Java, and it’s even worse for Java’s instrumentation-based profilers.

For sampling profilers, the so-called observer effect gets in the way: when we profile a program, the profiling itself can change the program’s performance behavior. This means we can’t simply increase the sampling frequency to get a more accurate profile, because the sampling causes inaccuracies. So, how could we possibly know whether a profile correctly reflects an execution?

We could try to look at the code and estimate how long each bit takes, and then painstakingly compute what an accurate profile would be. Unfortunately, with the complexity of today’s processors and language runtimes, this would require a cycle-accurate simulator that needs to model everything, from the processor’s pipeline, over the cache hierarchy, to memory and storage. While there are simulators that do this kind of thing, they are generally too slow to simulate a full JVM with JIT compilation for any interesting program within a practical amount of time. This means that simulation is currently impractical, and it is impractical to determine what a ground truth would be.

So, what other approaches might there be to determine whether a profile is accurate?

In 2010, Mytkowicz et al. already checked whether Java profilers were actionable by inserting computations at the Java bytecode level. On today’s VMs, that’s unfortunately an approach that changes performance in fairly unpredictable ways, because it interacts with the compiler optimizations. However, the idea to check whether a profiler accurately reflects the slowdown of a program is sound. For example, an inaccurate profiler is less likely to correctly identify a change in the distribution of where a program spends its time. Similarly, if we change the overall amount of time a program takes, without changing the distribution of where time is spent, it may attribute run time to the wrong parts of a program.

We can detect both of these issues by accurately slowing down a program. And, as you might know from the previous post, we are able to slow down programs fairly accurately. Figure 1 illustrates the idea with a stacked bar chart for a hypothetical distribution of run-time over three methods. This distribution should remain identical, independent of a slowdown observed by the program. So, there’s a linear relation between the absolute time measured and a constant relation between the percentage of time per method, depending on the slowdown.

Figure 1: A stacked bar chart for a hypothetical program execution, showing the absolute time per method. A profiler should see the linear increase in run time taken by each method, but still report the same percentage of run time taken. If a profiler reports something else, we have found an inaccuracy.

With this slowdown approach, we can detect whether the profiler is accurate with respect to the predicted time increase. I’ll leave all the technical details to the paper. We can also slow down individual basic blocks accurately to make a particular method take more time. As it turns out, this is a good litmus test for the accuracy of profilers, and we find a number of examples where they fail to attribute the run time correctly. Figure 2 shows an example for the Havlak benchmark. The bar charts show how much change the four profilers detect after we slowed down Vector.hasSome to the level indicated by the red dashed line. In this particular example, async-profiler detects the change accurately. JFR is probably within the margin of error. However, JProfiler and YourKit are completely off. JProfiler likely can’t deal with inlining and attributes the change to the forEach method that calls hasSome. YourKit does not seem to see the change at all.

Figure 2: Bar chart with the change in run time between the baseline and slowed-down version, for the top 5 methods of the Havlak benchmark. The red dashed line indicates the expected change for the Vector.hasSome method. Only async-profiler and JFR come close to the expectation.

With this slowdown-based approach, we finally have a way to see how accurate sampling profilers are by approximating the ground truth profile. Since we can’t measure the ground truth directly, we found a way to sidestep a fundamental problem and found a reasonably practical solution.

The paper details how we implement our divining approach, i.e., how we slow down programs accurately. It also has all the methodological details, research questions, benchmarking setup, and lots more numbers, especially in the appendix. So, please give it a read, and let us know what you think.

If you happen to attend the SPLASH conference, Humphrey is presenting our work today and on Saturday.

Questions, pointers, and suggestions are always welcome, for instance, on Mastodon, BlueSky, or Twitter.

Thanks to Octave for feedback on this post.

Update: The recording of the talk is now on YouTube.

Abstract

Optimizing performance on top of modern runtime systems with just-in-time (JIT) compilation is a challenge for a wide range of applications from browser-based applications on mobile devices to large-scale server applications. Developers often rely on sampling-based profilers to understand where their code spends its time. Unfortunately, sampling of JIT-compiled programs can give inaccurate and sometimes unreliable results.

To assess accuracy of such profilers, we would ideally want to compare their results to a known ground truth. With the complexity of today’s software and hardware stacks, such ground truth is unfortunately not available. Instead, we propose a novel technique to approximate a ground truth by accurately slowing down a Java program at the machine-code level, preserving its optimization and compilation decisions as well as its execution behavior on modern CPUs.

Our experiments demonstrate that we can slow down benchmarks by a specific amount, which is a challenge because of the optimizations in modern CPUs, and we verified with hardware profiling that on a basic-block level, the slowdown is accurate for blocks that dominate the execution. With the benchmarks slowed down to specific speeds, we confirmed that async-profiler, JFR, JProfiler, and YourKit maintain original performance behavior and assign the same percentage of run time to methods. Additionally, we identify cases of inaccuracy caused by missing debug information, which prevents the correct identification of the relevant source code. Finally, we tested the accuracy of sampling profilers by approximating the ground truth by the slowing down of specific basic blocks and found large differences in accuracy between the profilers.

We believe, our slowdown-based approach is the first practical methodology to assess the accuracy of sampling profilers for JIT-compiling systems and will enable further work to improve the accuracy of profilers.

  • Divining Profiler Accuracy: An Approach to Approximate Profiler Accuracy Through Machine Code-Level Slowdown
    H. Burchell, S. Marr; Proceedings of the ACM on Programming Languages, OOPSLA'25, ACM, 2025.
  • Paper: PDF
  • DOI: 10.1145/3763180
  • Appendix: online appendix
  • BibTex: bibtex
    @article{Burchell:2025:Divining,
      abstract = {Optimizing performance on top of modern runtime systems with just-in-time (JIT) compilation is a challenge for a wide range of applications from browser-based applications on mobile devices to large-scale server applications. Developers often rely on sampling-based profilers to understand where their code spends its time. Unfortunately, sampling of JIT-compiled programs can give inaccurate and sometimes unreliable results.
      
      To assess accuracy of such profilers, we would ideally want to compare their results to a known ground truth. With the complexity of today's software and hardware stacks, such ground truth is unfortunately not available. Instead, we propose a novel technique to approximate a ground truth by accurately slowing down a Java program at the machine-code level, preserving its optimization and compilation decisions as well as its execution behavior on modern CPUs.
      
      Our experiments demonstrate that we can slow down benchmarks by a specific amount, which is a challenge because of the optimizations in modern CPUs, and we verified with hardware profiling that on a basic-block level, the slowdown is accurate for blocks that dominate the execution. With the benchmarks slowed down to specific speeds, we confirmed that async-profiler, JFR, JProfiler, and YourKit maintain original performance behavior and assign the same percentage of run time to methods. Additionally, we identify cases of inaccuracy caused by missing debug information, which prevents the correct identification of the relevant source code. Finally, we tested the accuracy of sampling profilers by approximating the ground truth by the slowing down of specific basic blocks and found large differences in accuracy between the profilers.
      
      We believe, our slowdown-based approach is the first practical methodology to assess the accuracy of sampling profilers for JIT-compiling systems and will enable further work to improve the accuracy of profilers.},
      acceptancerate = {0.356},
      appendix = {https://doi.org/10.5281/zenodo.16911348},
      articleno = {402},
      author = {Burchell, Humphrey and Marr, Stefan},
      blog = {https://stefan-marr.de/2025/10/can-we-know-whether-a-profiler-is-accurate/},
      doi = {10.1145/3763180},
      issn = {2475-1421},
      journal = {Proceedings of the ACM on Programming Languages},
      keywords = {Accuracy GroundTruth Java MeMyPublication Profiling Sampling myown},
      month = oct,
      number = {OOPSLAB25},
      numpages = {32},
      pdf = {https://stefan-marr.de/downloads/oopsla25-burchell-marr-divining-profiler-accuracy.pdf},
      publisher = {{ACM}},
      series = {OOPSLA'25},
      title = {{Divining Profiler Accuracy: An Approach to Approximate Profiler Accuracy Through Machine Code-Level Slowdown}},
      year = {2025},
      month_numeric = {10}
    }
    

Older Posts

Subscribe via RSS