How to Slow Down a Program? And Why it Can Be Useful.

Most research on programming language performance asks a variation of a single question: how can we make some specific program faster? Sometimes we may even investigate how we can use less memory. This means a lot of research focuses solely on reducing the amount of resources needed to achieve some computational goal.

So, why on earth might we be interested in slowing down programs then?

Slowing Down Programs is Surprisingly Useful!

Making programs slower can be useful to find race conditions, to simulate speedups, and to assess how accurate profilers are.

To detect race conditions, we may want to use an approach similar to fuzzing. Instead of exploring a program’s implementation by varying its input, we can explore different instruction interleavings, thread or event schedules, by slowing down program parts to change timings. This approach allows us to identify concurrency bugs and is used by CHESS, WAFFLE, and NACD.

The Coz profiler is an example of how slowing down programs can be used to simulate speedup. With Coz, we can estimate whether an optimization is beneficial before implementing it. Coz simulates it by slowing down all other program parts. The part we think might be optimizable stays at the same speed it was before, but is now virtually sped up, which allows us to see whether it gives enough of a benefit to justify a perhaps lengthy optimization project.

And, as mentioned before, we can also use it to assess how accurate profilers are. Though, I’ll leave this for the next blog posts. :)

The current approaches to slowing down programs for these use cases are rather coarse-grained though. Race detection often adapts the scheduler or uses, for example, APIs such as Thread.sleep(). Similarly, Coz pauses the execution of the other threads. Work on measuring whether profilers give actionable results, inserts bytecodes into Java programs to compute Fibonacci numbers.

By using more fine-grained slowdowns, we think we could make race detection, speedup estimation, and profiler accuracy assessments more precise. Thus, we looked into inserting slowdown instructions into basic blocks.

Which x86 Instructions Allow us to Consistently Slow Down Basic Blocks?

Let’s assume we run on some x86 processor, and we are looking at programs from the perspective of processors.

When running a benchmark like Towers, the OpenJDK’s HotSpot JVM may compile it to x86 instructions like this:

1
2
3
4
5
6
7
mov dword ptr [rsp+0x18], r8d
mov dword ptr [rsp], ecx
mov qword ptr [rsp+0x20], rsi
mov ebx, dword ptr [rsi+0x10]
mov r9d, edx
cmp edx, 0x1
jnz 0x... <Block 55>	

This is one of the basic blocks produced by HotSpot’s C2 compiler. For our purposes, it suffices to see that there are some memory accesses with the mov instructions, and we end up checking whether the edx register contains the value 1. If that’s not the case, we jump to Block 55. Otherwise, execution continues in the next basic block. A key property of a basic block is that there’s no control flow inside of it, which means once it starts executing, all of its instructions will execute.

Though, how can we slow it down?

x86 has many many different instructions one could try to insert into the block, which each will probably consume CPU cycles. However, modern CPUs try to execute as many instructions as possible at the same time using out-of-order execution. This means, instructions in our basic block that do not directly depend on each other might be executed at the same time. For instance, the first three mov instructions access neither the same register nor memory location. This means the order in which they are executed here does not matter. Though, which optimizations CPUs apply depends on the program and the specific CPU generation, or rather microarchitecture.

To find suitable instructions to slow down basic blocks, we experimented only on an Intel Core i5-10600 CPU, which has the Comet Lake-S microarchitecture. On other microarchitectures, things can be very different.

For the slowdown that we want, we can use nop or mov regX, regX instructions on Comet Lake-S. This mov would move the value from register X to itself, so basically does nothing. These two instructions give us a slowdown that is small enough to slow down most blocks accurately to a desired target speed, and the slowdown seems to affect only the specific block it is meant for.

Our basic block from earlier would then perhaps end up with nop instructions interleaved after each instruction. In practice, the number of instructions we need to insert depends on how much time a basic block takes in the program. Though, for illustration, it might look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
mov dword ptr [rsp+0x18], r8d
nop
mov dword ptr [rsp], ecx
nop
mov qword ptr [rsp+0x20], rsi
nop
mov ebx, dword ptr [rsi+0x10]
nop
mov r9d, edx
nop
cmp edx, 0x1
nop
jnz 0x... <Block 55>	

We tried six different candidates, including a push-pop sequence, to get a better impression of how Comet Lake-S deals with them. For more details of how and what we tried, please have a look at our short paper below, which we will present at the VMIL workshop.

When inserting these instructions into basic blocks, so that each individual basic block takes about twice as much time as before, we end up with a program that indeed is overall twice as slow, as one would hope. Even better, when we look at the Towers benchmark with the async-profiler for HotSpot, and compare the proportions of run time it attributes to each method, the slowed-down and the normal version match almost perfectly, as illustrated below. The same is not true for the other candidates we looked at.

Figure 1: A scatter plot per slowdown instruction with the median run-time percentage for the top six Java methods of Towers. The X=Y diagonal indicates that a method’s run‐time percentage remains the same with and without slowdown.

The paper has a few more details, including a more detailed analysis of the slowdown each candidate introduces, how precise the slowdown is for all basic blocks in the benchmark, and whether it makes a difference when we put the slowdown all at the beginning, interleaved, or at the end.

Of course, this work is merely a stepping stone to more interesting things, which I will look at in a bit more detail in the next post.

Until then, the paper is linked below, and questions, pointers, and suggestions are welcome on Mastodon, BlueSky, or Twitter.

Abstract

Slowing down programs has surprisingly many use cases: it helps finding race conditions, enables speedup estimation, and allows us to assess a profiler’s accuracy. Yet, slowing down a program is complicated because today’s CPUs and runtime systems can optimize execution on the fly, making it challenging to preserve a program’s performance behavior to avoid introducing bias.

We evaluate six x86 instruction candidates for controlled and fine-grained slowdown including NOP, MOV, and PAUSE. We tested each candidate’s ability to achieve an overhead of 100%, to maintain the profiler-observable performance behavior, and whether slowdown placement within basic blocks influences results. On an Intel Core i5-10600, our experiments suggest that only NOP and MOV instructions are suitable. We believe these experiments can guide future research on advanced developer tooling that utilizes fine-granular slowdown at the machine-code level.

  • Evaluating Candidate Instructions for Reliable Program Slowdown at the Compiler Level: Towards Supporting Fine-Grained Slowdown for Advanced Developer Tooling
    H. Burchell, S. Marr; In Proceedings of the 17th ACM SIGPLAN International Workshop on Virtual Machines and Intermediate Languages, VMIL'25, p. 8, ACM, 2025.
  • Paper: PDF
  • DOI: 10.1145/3759548.3763374
  • BibTex: bibtex
    @inproceedings{Burchell:2025:SlowCandidates,
      abstract = {Slowing down programs has surprisingly many use cases: it helps finding race conditions, enables speedup estimation, and allows us to assess a profiler's accuracy. Yet, slowing down a program is complicated because today's CPUs and runtime systems can optimize execution on the fly, making it challenging to preserve a program's performance behavior to avoid introducing bias.
      
      We evaluate six x86 instruction candidates for controlled and fine-grained slowdown including NOP, MOV, and PAUSE. We tested each candidate’s ability to achieve an overhead of 100%, to maintain the profiler-observable performance behavior, and whether slowdown placement within basic blocks influences results. On an Intel Core i5-10600, our experiments suggest that only NOP and MOV instructions are suitable. We believe these experiments can guide future research on advanced developer tooling that utilizes fine-granular slowdown at the machine-code level.},
      author = {Burchell, Humphrey and Marr, Stefan},
      booktitle = {Proceedings of the 17th ACM SIGPLAN International Workshop on Virtual Machines and Intermediate Languages},
      doi = {10.1145/3759548.3763374},
      isbn = {979-8-4007-2164-9/2025/10},
      keywords = {Benchmarking HotSpot ISA Instructions Java MeMyPublication assembly evaluation myown slowdown x86},
      location = {Singapore},
      month = oct,
      pages = {8},
      pdf = {https://stefan-marr.de/downloads/vmil25-burchell-marr-evaluating-candidate-instructions-for-reliable-program-slowdown-at-the-compiler-level.pdf},
      publisher = {{ACM}},
      series = {VMIL'25},
      title = {{Evaluating Candidate Instructions for Reliable Program Slowdown at the Compiler Level: Towards Supporting Fine-Grained Slowdown for Advanced Developer Tooling}},
      year = {2025},
      month_numeric = {10}
    }
    

It's Thursday, and My Last* Day at Kent

Today is the 31st of July 2025, and from tomorrow on I’ll be “between jobs”, or as Gen Z allegedly calls it, on a micro-retirement.

When I first came to Kent for my interview, I was thinking, I’ll do this one for practice. I still had more than 2 years left on a research grant we just got, which promised to be lots of fun, but academic jobs for PL systems people are rare, even rarer these days. But then I got the call from Richard Jones, offering me the position, and I never regretted taking him up on it.

Kent’s School of Computing was just growing its Programming Languages and Systems (PLAS) group and Richard, Simon Thompson, Andy King, Peter Rodgers, and many others at the School did a remarkable job in creating an environment and community that was truly supportive of young academics taking their first steps in a permanent academic post. Be it about wrestling with teaching duties, papers, reviews, reviewers, and of course grant writing. PLAS and the School of Computing was the right place for me.

Of course, many things changed since my start in October 2017. Perhaps most notably, Computing is now in the Kennedy building, a very nice space. But there was also that moment, where we, the young ones, became the “senior” ones. Mark, Laura, and Dominic grew well into their new roles and I can only hope that I passed on some of the extensive support I got, to the people who started after me.

There are many challenges ahead for my dear colleagues at Kent, but I hope, that enough of the spirit of support and community remains in the School, enabling PLAS and the next generation of academics to do great things.

Also a huge thank you to Kemi, Anna, and Janet for keeping the School afloat.

I’ll miss you all. Thanks for everything! And see you soon!

Most of PLAS in October 2023

* It’s a little more complicated than that, but for good reasons. Right, EPSRC? :)

Instrumentation-based Profiling on JVMs is Broken!

Last year, we looked at how well sampling profilers work on top of the JVM. Unfortunately, they suffer from issues such as safepoint bias and may not correctly attribute observed run time to the correct methods because of the complexities introduced by inlining and other compiler optimizations.

After looking at sampling profilers, Humphrey started to investigate instrumentation-based profilers and found during his initial investigation that they were giving much more consistent numbers. Unfortunately, it became quickly clear that the state-of-the-art instrumentation-based profilers on the JVM also have major issues, which results in profiles that are not representative of production performance. Since profilers are supposed to help us identify performance issues, they fail at their one job.

When investigating them further, we found that they interact badly with inlining and other standard optimizations. Because the profilers we found instrument JVM bytecodes, they add a lot of extra code that compiler optimizations treat as any other application code. While this does not strictly prevent optimizations such as inlining, the extra code interferes enough with the optimization that the observable behavior of a program with and without inlining is basically identical. In practice, this means that instrumentation-based profilers on the JVM are easily portable, but they can’t effectively guide developers to the code that would benefit most from attention, which is their main purpose.

Profilers that do not capture production performance will misguide us!

While they can still identify the code that is activated most often, the interaction with optimizations means that developers see mostly unoptimized behavior. With today’s highly optimizing compilers this is unfortunate, because we may end up optimizing code that the compiler normally would have optimized for us already, and we spend time on things that likely won’t make a difference in production.

Let’s look at an example from our paper:

class ActionA { int id; void execute() {} }
class ActionB { int id; void execute() {} }
var actions = getMixOfManyActions();
bubbleSortById(actions);
framework.execute(actions);

In this arguably a little contrived example, we use some kind of framework, for which we have actions that the framework applies for us. This is probably a worst case for profilers that instrument bytecodes. Here, the execute() methods would be identified as the most problematic aspect. Though, they don’t do anything. A just-in-time compiler like HotSpot’s C2, would likely end up seeing a bimorphic call site to execute() and inline both methods. And if the compiler heuristics are with us, it might even optimize out the empty loop in the framework.

So, if we assume a sufficiently smart compiler, here our inefficient code, that’s forced on us by a framework, is being taken care of by the compiler. And a good profiler, would ideally guide us to the bubbleSortById(.) as being of interest. Typically, we’d expect to get a good speedup here by switching to a more suitable sort, especially since we implicitly assume there are many actions so that this code matters in production.

To me this means, instrumentation-based profilers can only be a matter of last resort when sampling with its own flaws fails. They are just not useful enough as they are.

Can we do better than profilers that instrument bytecode?

At the time, Humphrey was quite in favor of instrumentation, because it gives very consistent results. So, he wanted to make the results of instrumentation-based profilers more realistic. Inspired by the work of Basso et al., he built an instrumentation-based profiler into the Graal just-in-time compiler that works more like classic instrumentation-based profilers for ahead-of-time-compiled language implementations.

The basic idea is illustrated below:

Figure 1: Instrumentation-based profilers on the JVM typically insert instrumentation very early, before compilers optimize code. In our profiler, instrumentation is inserted very late, to minimize interfering with optimizations.

Instead of inserting the instrumentation right when the bytecode is loaded, for instance with an agent or some other form of bytecode rewriting, we move the addition of instrumentation code to a much later part of the just-in-time compilation. Most importantly, we insert it only after inlining and most optimizations are performed. To keep the prototype simple, we insert the probes right before it is turned into the lower level IR. At this point, there are still a few optimizations to be performed, including instruction selection and register allocation. Though, in the grand scheme of things, these are minor.

How much better is it?

With his prototype, Humphrey managed to achieve not only much better performance than classic instrumentation-based profilers, but also minimize interference with optimizations. For a rough idea of the overall performance impact of this approach, let’s have a look at Figure 2:

Figure 12: Sampling-based profilers such as Async, Honest, JFR, Perf, and YourKit (in sampling mode) have very low overhead, though suffer from safepoint bias and only observe samples. YourKit and JProfiler doing instrumentation introduce overhead of two orders of magnitudes and lead to unrealistic results because of their impact on optimizations. Bubo, our prototype, has much lower overhead, and does not interfere with optimizations.

With a few extra tricks briefly sketched in the paper, we get good attribution of where time is spent, even in the presence of inlining, reduce overhead, and benefit from the more precise results of instrumentation, because it does not have the same drawbacks of only occasionally obtaining data.

There’s one major open question though: what does a correct profile look like? At the moment, we can’t assess whether our approach is correct. Sampling profilers, as we saw last year, also do not agree on a single answer. So, while we believe our approach is much better than classic instrumentation, we still need to find out how correct it is.

All results so far, and a few more technical details are in the paper linked below. Questions, pointers, and suggestions are greatly appreciated perhaps on Mastodon or Twitter @smarr.

Abstract

Profilers are crucial tools for identifying and improving application performance. However, for language implementations with just-in-time (JIT) compilation, e.g., for Java and JavaScript, instrumentation-based profilers can have significant overheads and report unrealistic results caused by the instrumentation.

In this paper, we examine state-of-the-art instrumentation-based profilers for Java to determine the realism of their results. We assess their overhead, the effect on compilation time, and the generated bytecode. We found that the profiler with the lowest overhead increased run time by 82x. Additionally, we investigate the realism of results by testing a profiler’s ability to detect whether inlining is enabled, which is an important compiler optimization. Our results document that instrumentation can alter program behavior so that performance observations are unrealistic, i.e., they do not reflect the performance of the uninstrumented program.

As a solution, we sketch late-compiler-phase-based instrumentation for just-in-time compilers, which gives us the precision of instrumentation-based profiling with an overhead that is multiple magnitudes lower than that of standard instrumentation-based profilers, with a median overhead of 23.3% (min. 1.4%, max. 464%). By inserting probes late in the compilation process, we avoid interfering with compiler optimizations, which yields more realistic results.

  • Towards Realistic Results for Instrumentation-Based Profilers for JIT-Compiled Systems
    H. Burchell, O. Larose, S. Marr; In Proceedings of the 21st ACM SIGPLAN International Conference on Managed Programming Languages and Runtimes, MPLR'24, ACM, 2024.
  • Paper: PDF
  • DOI: 10.1145/3679007.3685058
  • BibTex: bibtex
    @inproceedings{Burchell:2024:InstBased,
      abstract = {Profilers are crucial tools for identifying and improving application performance. However, for language implementations with just-in-time (JIT) compilation, e.g., for Java and JavaScript, instrumentation-based profilers can have significant overheads and report unrealistic results caused by the instrumentation.
      
      In this paper, we examine state-of-the-art instrumentation-based profilers for Java to determine the realism of their results. We assess their overhead, the effect on compilation time, and the generated bytecode. We found that the profiler with the lowest overhead increased run time by 82x. Additionally, we investigate the realism of results by testing a profiler’s ability to detect whether inlining is enabled, which is an important compiler optimization. Our results document that instrumentation can alter program behavior so that performance observations are unrealistic, i.e., they do not reflect the performance of the uninstrumented program.
      
      As a solution, we sketch late-compiler-phase-based instrumentation for just-in-time compilers, which gives us the precision of instrumentation-based profiling with an overhead that is multiple magnitudes lower than that of standard instrumentation-based profilers, with a median overhead of 23.3% (min. 1.4%, max. 464%). By inserting probes late in the compilation process, we avoid interfering with compiler optimizations, which yields more realistic results.},
      author = {Burchell, Humphrey and Larose, Octave and Marr, Stefan},
      blog = {https://stefan-marr.de/2024/09/instrumenation-based-profiling-on-jvms-is-broken/},
      booktitle = {Proceedings of the 21st ACM SIGPLAN International Conference on Managed Programming Languages and Runtimes},
      doi = {10.1145/3679007.3685058},
      keywords = {Graal Instrumentation JVM Java MeMyPublication Optimization Profiler Profiling Sampling myown},
      month = sep,
      pdf = {https://stefan-marr.de/downloads/mplr24-burchell-et-al-towards-realistic-results-for-instrumentation-based-profilers-for-jit-compiled-systems.pdf},
      publisher = {ACM},
      series = {MPLR'24},
      title = {{Towards Realistic Results for Instrumentation-Based Profilers for JIT-Compiled Systems}},
      year = {2024},
      month_numeric = {9}
    }
    

Older Posts

Subscribe via RSS