<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://stefan-marr.de/feed/index.xml" rel="self" type="application/atom+xml" /><link href="https://stefan-marr.de/" rel="alternate" type="text/html" /><updated>2026-04-03T18:54:03+02:00</updated><id>https://stefan-marr.de/feed/index.xml</id><title type="html">Stefan-Marr.de</title><subtitle>personal and research notes
</subtitle><entry><title type="html">Programming Language Implementation: In Theory, We Understand. In Practice, We Wish We Would.</title><link href="https://stefan-marr.de/2026/02/programming-language-implementation-in-theory-we-understand-in-practice-we-wish-we-would/" rel="alternate" type="text/html" title="Programming Language Implementation: In Theory, We Understand. In Practice, We Wish We Would." /><published>2026-02-02T15:15:10+01:00</published><updated>2026-02-02T15:15:10+01:00</updated><id>https://stefan-marr.de/2026/02/programming-language-implementation-in-theory-we-understand-in-practice-we-wish-we-would</id><content type="html" xml:base="https://stefan-marr.de/2026/02/programming-language-implementation-in-theory-we-understand-in-practice-we-wish-we-would/"><![CDATA[<p>It’s February! This means I have been <a href="https://stefan-marr.de/2025/10/first-day-at-jku/">at the JKU</a> for four months.
Four months with teaching <a href="https://ssw.jku.at/Teaching/Lectures/CB/VL/">Compiler Construction</a> and <a href="https://ssw.jku.at/Teaching/Lectures/SSW/">System Software</a>,
lots of new responsibilities (most notably signing off on telephone bills and coffee orders…), many new colleagues, and new things to learn for me, not least because of the very motivated students and PhD students here.
And when I say motivated, yes, I am very surprised. While the attendance of my 8:30am Compiler Construction lectures was declining throughout the term as expected, the students absolutely aced their exam.
I suspect I will have to make it harder next year. Much harder… hmmm 🤔
Much of the good results can likely be attributed to the very extensive exercise sessions run by my colleagues throughout the semester.</p>

<p>At this point, I have to send a big <em>thank you</em> to everyone from the <a href="https://ssw.jku.at/General/Staff/">Institute for System Software</a>, past and present.
It’s great to be part of such a team! You made my start very easy, and, well, it now gives me the time to think about my inaugural lecture.</p>

<h2 id="whats-an-inaugural-lecture">What’s an inaugural lecture?</h2>

<p>I have been in academia for almost two decades, but I have to admit, I don’t really remember being at an inaugural lecture.
According to Wikipedia, in the Germanic tradition an <a href="https://de.wikipedia.org/wiki/Antrittsvorlesung">inaugural lecture (Antrittsvorlesung)</a> is these days something of a celebration.
It’s a festive occasion for a new professor to present their field to a wider audience, possibly also presenting their research vision.</p>

<p>At the JKU, it indeed seems to be planned as a festive occasion, too.</p>

<p>On March 9th, 2026, starting at 4pm Prof. Bernhard Aichernig and I will give our <em>Antrittsvorlesungen</em>,
and you are <a href="https://www.jku.at/fileadmin/gruppen/90/Downloads/AVO_Aichernig_Marr/2026-03-09_Einladung_AVOL_Aichernig_Marr.pdf">cordially invited to attend</a>.</p>

<p>Bernhard will give a talk titled <a href="https://www.jku.at/fileadmin/gruppen/90/Downloads/AVO_Aichernig_Marr/Infoblatt_AVO_Aichernig_en.pdf">Verification, Falsification, and Learning – a Triptych of Formal Methods for Trustworthy IT Systems</a>.</p>

<p>My own talk is titled, as is this post: <a href="https://www.jku.at/fileadmin/gruppen/90/Downloads/AVO_Aichernig_Marr/Infoblatt_AVO_Marr_en.pdf">Programming Language Implementation: In Theory, We Understand. In Practice, We Wish We Would</a>.</p>

<p>Bernhard will start out by looking at the formal side of things, making the connection between proving correctness, testing systems in the context of where they are used, and learning models from observable data. My talk will narrow in on language implementations, but also look at how formal correctness is helping us there.
Unfortunately, provably-correct systems still elude us for many practical languages.
Even worse, we are at a point where we rarely understand what’s going on in enough detail to improve performance or perhaps fix certain rare bugs.</p>

<p>If you like to attend, <a href="http://www.jku.at/vas">please register here</a>.</p>

<h2 id="in-theory-we-understand-in-practice-we-wish-we-would">In Theory, We Understand. In Practice, We Wish We Would</h2>

<p>Here’s the abstract of my talk:</p>

<blockquote>
  <p>Our world runs on software, but we understand it less and less. In practice, the complexity of modern
systems drains your phone’s battery faster, increases the cost of hosting applications, and consumes
unnecessary resources, for instance, in AI systems. All because we do not truly understand our
systems any longer. Still, at a basic level, we can fully understand how computers work, from
transistors to processors, machine language, all the way up to high-level programming languages.</p>

  <p>The convenience of contemporary programming languages is however bought with complexity. Over
the last two decades, I admit, I added to that complexity. In the next two decades, I hope we can learn
to build programming languages in ways that we can prove to be correct, enable us to generate their
implementations automatically, and let systems select optimizations in a way that we can still
understand the implications for software running on top of it.</p>
</blockquote>

<p>You may now wonder where to go from here. And that’s a very good question.
I have another month to figure that out, perhaps more… 😅</p>

<p>So, maybe see you in March?</p>

<p>Until then, suggestions, questions, and complaints, as usual on 
<a href="https://mastodon.acm.org/@smarr/115297555824308876">Mastodon</a>,
<a href="https://bsky.app/profile/stefan-marr.de/post/3m24gusnpnk2l">BlueSky</a>, and
<a href="https://x.com/smarr/status/1973277801689260347">Twitter</a>.</p>]]></content><author><name></name></author><category term="Personal" /><category term="Personal" /><category term="Research" /><category term="Teaching" /><category term="Linz" /><summary type="html"><![CDATA[It’s February! This means I have been at the JKU for four months. Four months with teaching Compiler Construction and System Software, lots of new responsibilities (most notably signing off on telephone bills and coffee orders…), many new colleagues, and new things to learn for me, not least because of the very motivated students and PhD students here. And when I say motivated, yes, I am very surprised. While the attendance of my 8:30am Compiler Construction lectures was declining throughout the term as expected, the students absolutely aced their exam. I suspect I will have to make it harder next year. Much harder… hmmm 🤔 Much of the good results can likely be attributed to the very extensive exercise sessions run by my colleagues throughout the semester.]]></summary></entry><entry><title type="html">Python, Is It Being Killed by Incremental Improvements?</title><link href="https://stefan-marr.de/2026/01/python-killed-by-incremental-improvements-questionmark/" rel="alternate" type="text/html" title="Python, Is It Being Killed by Incremental Improvements?" /><published>2026-01-20T14:02:15+01:00</published><updated>2026-01-20T14:02:15+01:00</updated><id>https://stefan-marr.de/2026/01/python-killed-by-incremental-improvements-questionmark</id><content type="html" xml:base="https://stefan-marr.de/2026/01/python-killed-by-incremental-improvements-questionmark/"><![CDATA[<p>Over the past years, two major players invested into the future of Python. Microsoft’s Faster CPython team has pushed ahead with impressive performance improvements for the CPython interpreter, which has gotten at least 2x faster since Python 3.9. They also have a baseline JIT compiler for CPython, too. At the same time, Meta is worked hard on making free-threaded Python a reality to bring classic shared-memory multithreading to Python, without being limited by the still standard Global Interpreter Lock, which prevents true parallelism.</p>

<p>Both projects deliver major improvements to Python, and the wider ecosystem. So, it’s all great, or is it?</p>

<p>In <a href="https://youtu.be/03DswsNUBdQ">my talk talk on this topic at SPLASH, which is now online</a>, I discussed some of the aspects the Python core developers and wider community seem to not regard with the same urgency as I would hope for. Concurrency makes me scared, and I strongly believe the Python ecosystem should be scared, too, or look forward to the 2030s being “Python’s Decade of Concurrency Bugs”.</p>

<p>In the talk, I started out reviewing some of the changes in observable language semantics between Python 3.9 and today and discuss their implications.
I previously discussed the changes around the <em>global interpreter lock</em> in my post on the <a href="https://stefan-marr.de/2023/11/python-global-interpreter-lock/">changing “guarantees”</a>. In the talk, I also use the example from a real bug report, to illustrate the semantic changes:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>request_id = self._next_id
self._next_id += 1
</code></pre></div></div>

<p>It looks simple, but reveals quite profound differences between Python versions.</p>

<p>Since I have some old ideas lying around, I also propose a way forward.
In practice though, this isn’t a small well-defined engineering or research project.
So, I hope I can inspire some of you to follow me down the rabbit hole of Python’s free-threaded future.</p>

<p>Incidentally, the <a href="https://truffleruby.dev/blog/truffleruby-33-is-released">latest release of TruffleRuby</a> now uses many of the techniques that would be useful for Python. <a href="https://eregon.me/">Benoit Daloze</a> implemented them during his PhD and we originally <a href="https://stefan-marr.de/downloads/oopsla18-daloze-et-al-parallelization-of-dynamic-languages-synchronizing-built-in-collections.pdf">published the ideas back in 2018</a>.</p>

<p>Questions, pointers, and suggestions are always welcome, for instance, on
<a href="https://mastodon.acm.org/@smarr/115927813942871408">Mastodon</a>,
<a href="https://bsky.app/profile/stefan-marr.de/post/3mcudj6o42k24">BlueSky</a>, or
<a href="https://x.com/smarr/status/2013615257567023329">Twitter</a>.</p>

<p><a href="https://youtu.be/03DswsNUBdQ"><img src="/assets/2026/01/python-killed-by-incremental-improvements.jpg" alt="Screen grab of recording, showing title slide and myself at the podium." style="width: 100%; border: 1px solid black;" /></a></p>

<h3 id="slides">Slides</h3>

<iframe src="https://1drv.ms/p/c/30da3b1ead53408b/IQSrY83HcUNEToYQr3sV76i1AQwQ0IKantC0U-qsl0LHhwE" width="592" height="481" frameborder="0" scrolling="no"></iframe>]]></content><author><name></name></author><category term="Research" /><category term="Python" /><category term="Research" /><category term="Concurrency" /><category term="Concurrency Models" /><category term="CPython" /><category term="Interpreters" /><category term="Language Implementation" /><category term="Dynamic Languages" /><category term="Language Design" /><category term="Parallelism" /><category term="Presentation" /><summary type="html"><![CDATA[Over the past years, two major players invested into the future of Python. Microsoft’s Faster CPython team has pushed ahead with impressive performance improvements for the CPython interpreter, which has gotten at least 2x faster since Python 3.9. They also have a baseline JIT compiler for CPython, too. At the same time, Meta is worked hard on making free-threaded Python a reality to bring classic shared-memory multithreading to Python, without being limited by the still standard Global Interpreter Lock, which prevents true parallelism.]]></summary></entry><entry><title type="html">Benchmarking Language Implementations: Am I doing it right? Get Early Feedback!</title><link href="https://stefan-marr.de/2025/11/experimental-setups/" rel="alternate" type="text/html" title="Benchmarking Language Implementations: Am I doing it right? Get Early Feedback!" /><published>2025-11-17T19:00:24+01:00</published><updated>2025-11-17T19:00:24+01:00</updated><id>https://stefan-marr.de/2025/11/experimental-setups</id><content type="html" xml:base="https://stefan-marr.de/2025/11/experimental-setups/"><![CDATA[<p>Modern CPUs, operating systems, and software in general do lots of smart and hard-to-track optimizations, leading to warmup behavior, cache effects, profile pollution and other unexpected interactions.
For us engineers and scientists, whether in industry or academia, this unfortunately means that we may not fully understand the system on top of which we are trying to measure the performance impact of, for instance, an optimization, a new feature, a data structure, or even a bug fix.</p>

<p>Many of us even treat the hardware and software we run on top of as black boxes, relying on the <em>scientific method</em> to give us a good degree of confidence in the understanding of the performance results we are seeing.</p>

<p>Unfortunately, with the complexity of today’s systems, we can easily miss important confounding variables.
Did we account, e.g., for CPU frequency scaling, garbage collection, JIT compilation, and network latency correctly?
If not, this can lead us down the wrong, and possibly time-consuming path of implementing experiments that do not yield the results we are hoping for, or our experiments are too specific to allow us to draw general conclusions.</p>

<p>So, what’s the solution? What could a PhD student or industrial researcher do when planning for the next large project?</p>

<p>How about getting early feedback?</p>

<h3 id="get-early-feedback-at-a-language-implementation-workshop">Get Early Feedback at a Language Implementation Workshop!</h3>

<p>At the <a href="https://2025.programming-conference.org/home/MoreVMs-2025#exp-setup">MoreVMs</a> and <a href="https://conf.researchr.org/home/icfp-splash-2025/vmil-2025#Call-for-Papers">VMIL</a> workshop series, we introduced a new category of submissions last year:
<em>Experimental Setups</em>.</p>

<p>We solicited extended abstracts that focus on the experiments themselves before an implementation is completed.
This way, the experimental setup can receive feedback and guidance to improve the chances that the experiments lead to the desired outcomes. With early feedback, we can avoid common traps and pitfalls, share best practices, and deeper understanding of the systems we are using.</p>

<p>With the complexity of today’s systems, one person, or even one group, is not likely to think of all the issues that may be relevant. Instead of encountering these issues only in the review process after all experiments are done, we can share knowledge and ideas ahead of time, and hopefully <em>improve the science</em>!</p>

<p>So, if you think you may benefit from such feedback, please consider submitting an extended abstract describing your experimental goals and methodology. No results needed!</p>

<p>The next submission deadlines for the <a href="https://2026.programming-conference.org/home/MoreVMs-2026">MoreVMs’26 workshop</a> are:</p>
<ul>
  <li>December 17th, 2025</li>
  <li>January 12th, 2026</li>
</ul>

<p>For questions and suggestions, find me on
<a href="https://mastodon.acm.org/@smarr/115571485445944651">Mastodon</a>,
<a href="https://bsky.app/profile/stefan-marr.de/post/3m5w3ouay222f">BlueSky</a>, or
<a href="https://x.com/smarr/status/1990808954067177885">Twitter</a>, or send me an <a href="https://ssw.jku.at/General/Staff/Marr/">email</a>!</p>]]></content><author><name></name></author><category term="Research" /><category term="Benchmarking" /><category term="Workshops" /><category term="Publications" /><category term="Papers" /><category term="Science" /><category term="Methodology" /><summary type="html"><![CDATA[Modern CPUs, operating systems, and software in general do lots of smart and hard-to-track optimizations, leading to warmup behavior, cache effects, profile pollution and other unexpected interactions. For us engineers and scientists, whether in industry or academia, this unfortunately means that we may not fully understand the system on top of which we are trying to measure the performance impact of, for instance, an optimization, a new feature, a data structure, or even a bug fix.]]></summary></entry><entry><title type="html">Can We Know Whether a Profiler is Accurate?</title><link href="https://stefan-marr.de/2025/10/can-we-know-whether-a-profiler-is-accurate/" rel="alternate" type="text/html" title="Can We Know Whether a Profiler is Accurate?" /><published>2025-10-15T02:26:28+02:00</published><updated>2025-10-15T02:26:28+02:00</updated><id>https://stefan-marr.de/2025/10/can-we-know-whether-a-profiler-is-accurate</id><content type="html" xml:base="https://stefan-marr.de/2025/10/can-we-know-whether-a-profiler-is-accurate/"><![CDATA[<p>If you have been following the adventures of our <a href="https://github.com/HumphreyHCB">hero</a> over the last couple of years,
you might remember that we <a href="https://stefan-marr.de/2023/09/dont-blindly-trust-your-profiler/">can’t really trust sampling profilers for Java</a>,
and <a href="https://stefan-marr.de/2024/09/instrumenation-based-profiling-on-jvms-is-broken/">it’s even worse for Java’s instrumentation-based profilers</a>.</p>

<p>For sampling profilers, the so-called <em>observer effect</em> gets in the way: when we profile a program, the profiling itself can change the program’s performance behavior. This means we can’t simply increase the sampling frequency to get a more accurate profile, because the sampling causes inaccuracies.
So, how could we possibly know whether a profile correctly reflects an execution?</p>

<p>We could try to look at the code and estimate how long each bit takes, and then painstakingly compute what an accurate profile would be. Unfortunately, with the complexity of today’s processors and language runtimes, this would require a cycle-accurate simulator that needs to model everything, from the processor’s pipeline, over the cache hierarchy, to memory and storage.
While there are simulators that do this kind of thing, they are generally too slow to simulate a full JVM with JIT compilation for any interesting program within a practical amount of time.
This means that simulation is currently impractical, and it is impractical to determine what a <em>ground truth</em> would be.</p>

<p>So, what other approaches might there be to determine whether a profile is accurate?</p>

<p>In 2010, <a href="https://dl.acm.org/doi/10.1145/1806596.1806618">Mytkowicz et al.</a> already checked whether Java profilers were <em>actionable</em>
by inserting computations at the Java bytecode level.
On today’s VMs, that’s unfortunately an approach that changes performance in fairly unpredictable ways, because it interacts with the compiler optimizations.
However, the idea to check whether a profiler accurately reflects the slowdown of a program is sound.
For example, an inaccurate profiler is less likely to correctly identify a change in the distribution of where a program spends its time.
Similarly, if we change the overall amount of time a program takes, without changing the distribution of where time is spent,
it may attribute run time to the wrong parts of a program.</p>

<p>We can detect both of these issues by accurately slowing down a program.
And, as you might know from the <a href="/2025/08/how-to-slow-down-a-program/">previous post</a>,
we are able to slow down programs fairly accurately.
<a href="#fig1">Figure 1</a> illustrates the idea with a stacked bar chart for a hypothetical distribution of run-time over three methods. This distribution should remain identical, independent of a slowdown observed by the program.
So, there’s a linear relation between the absolute time measured and a constant relation between the percentage of time per method, depending on the slowdown.</p>

<figure id="fig1">
<img src="/assets/2025/10/sketch-of-ideal-slowdown.svg" />
<figcaption><strong>Figure 1:</strong> A stacked bar chart for a hypothetical program execution, showing the absolute time per method. A profiler should see the linear increase in run time taken by each method, but still report the same percentage of run time taken. If a profiler reports something else, we have found an inaccuracy.</figcaption>
</figure>

<p>With this slowdown approach, we can detect whether the profiler is accurate with respect to the predicted time increase.
I’ll leave all the technical details to the <a href="#paper">paper</a>.
We can also slow down individual basic blocks accurately to make a particular method take more time.
As it turns out, this is a good litmus test for the accuracy of profilers,
and we find a number of examples where they fail to attribute the run time correctly.
<a href="#fig2">Figure 2</a> shows an example for the <a href="https://github.com/smarr/are-we-fast-yet/tree/master/benchmarks/Java/src/havlak">Havlak benchmark</a>.
The bar charts show how much change the four profilers detect after we slowed down
<code>Vector.hasSome</code> to the level indicated by the red dashed line.
In this particular example, async-profiler detects the change accurately.
JFR is probably within the margin of error.
However, JProfiler and YourKit are completely off. JProfiler likely can’t deal with inlining and attributes the change to the <code>forEach</code> method that calls <code>hasSome</code>.
YourKit does not seem to see the change at all.</p>

<figure id="fig2">
<img src="/assets/2025/10/havlak-slowdown-of-hassome.svg" />
<figcaption><strong>Figure 2:</strong> Bar chart with the change in run time between the baseline and slowed-down version, for the top 5 methods of the Havlak benchmark. 
The red dashed line indicates the expected change for the <code>Vector.hasSome</code> method. Only async-profiler and JFR come close to the expectation.</figcaption>
</figure>

<p>With this slowdown-based approach, we finally have a way to see how accurate sampling profilers are by approximating the <em>ground truth</em> profile. Since we can’t measure the ground truth directly, we found a way to sidestep a fundamental problem and found a reasonably practical solution.</p>

<p>The <a href="#paper">paper</a> details how we implement our <em>divining</em> approach, i.e., how we slow down programs accurately.
It also has all the methodological details, research questions, benchmarking setup, and lots more numbers, especially in the appendix. So, please give it a read, and let us know what you think.</p>

<p>If you happen to attend the SPLASH conference,
Humphrey is presenting our work <a href="https://conf.researchr.org/details/icfp-splash-2025/vmil-2025/3/Evaluating-Candidate-Instructions-for-Reliable-Program-Slowdown-at-the-Compiler-Level">today</a> and on <a href="https://2025.splashcon.org/details/OOPSLA/207/Divining-Profiler-Accuracy-An-Approach-to-Approximate-Profiler-Accuracy-Through-Mach">Saturday</a>.</p>

<p>Questions, pointers, and suggestions are always welcome, for instance, on
<a href="https://mastodon.acm.org/@smarr/115375396310482176">Mastodon</a>,
<a href="https://bsky.app/profile/stefan-marr.de/post/3m36z2u34tk2l">BlueSky</a>, or
<a href="https://x.com/smarr/status/1978259584432091530">Twitter</a>.</p>

<p>Thanks to <a href="https://octavelarose.github.io/">Octave</a> for feedback on this post.</p>

<p>Update: The <a href="https://youtu.be/U-7PEopwtKA">recording of the talk</a> is now on YouTube.</p>

<iframe width="560" height="315" src="https://www.youtube.com/embed/U-7PEopwtKA?si=Qd3OgW3B2IDmJ4cX" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen=""></iframe>

<p><a id="paper"></a></p>

<p><strong>Abstract</strong></p>

<blockquote>
  <p>Optimizing performance on top of modern runtime systems with just-in-time (JIT) compilation is a challenge for a wide range of applications from browser-based applications on mobile devices to large-scale server applications. Developers often rely on sampling-based profilers to understand where their code spends its time. Unfortunately, sampling of JIT-compiled programs can give inaccurate and sometimes unreliable results.</p>

<p>To assess accuracy of such profilers, we would ideally want to compare their results to a known ground truth. With the complexity of today’s software and hardware stacks, such ground truth is unfortunately not available. Instead, we propose a novel technique to approximate a ground truth by accurately slowing down a Java program at the machine-code level, preserving its optimization and compilation decisions as well as its execution behavior on modern CPUs.</p>

<p>Our experiments demonstrate that we can slow down benchmarks by a specific amount, which is a challenge because of the optimizations in modern CPUs, and we verified with hardware profiling that on a basic-block level, the slowdown is accurate for blocks that dominate the execution. With the benchmarks slowed down to specific speeds, we confirmed that async-profiler, JFR, JProfiler, and YourKit maintain original performance behavior and assign the same percentage of run time to methods. Additionally, we identify cases of inaccuracy caused by missing debug information, which prevents the correct identification of the relevant source code. Finally, we tested the accuracy of sampling profilers by approximating the ground truth by the slowing down of specific basic blocks and found large differences in accuracy between the profilers.</p>

<p>We believe, our slowdown-based approach is the first practical methodology to assess the accuracy of sampling profilers for JIT-compiling systems and will enable further work to improve the accuracy of profilers.</p>

</blockquote>

<ul>
  <li>Divining Profiler Accuracy: An Approach to Approximate Profiler Accuracy Through Machine Code-Level Slowdown<br />
    
      
      H. Burchell,
      <em>
      S. Marr</em>;

    
        Proceedings of the ACM on Programming Languages,
      

    OOPSLA'25,
    

    ACM,
    2025.

    </li>

    <li>
      Paper:
        <a href="https://stefan-marr.de/downloads/oopsla25-burchell-marr-divining-profiler-accuracy.pdf">
          PDF</a>
    </li>

    <li>
        DOI: <a href="https://doi.org/10.1145/3763180">10.1145/3763180</a>
    </li>

    
    <li>
      Appendix: <a href="https://doi.org/10.5281/zenodo.16911348">online appendix</a>
    </li>
    


    <li>
      BibTex:
      <span tabindex="0" class="bibtex"><span class="biblink">bibtex</span>
      <pre>@article{Burchell:2025:Divining,
  abstract = {Optimizing performance on top of modern runtime systems with just-in-time (JIT) compilation is a challenge for a wide range of applications from browser-based applications on mobile devices to large-scale server applications. Developers often rely on sampling-based profilers to understand where their code spends its time. Unfortunately, sampling of JIT-compiled programs can give inaccurate and sometimes unreliable results.
  
  To assess accuracy of such profilers, we would ideally want to compare their results to a known ground truth. With the complexity of today's software and hardware stacks, such ground truth is unfortunately not available. Instead, we propose a novel technique to approximate a ground truth by accurately slowing down a Java program at the machine-code level, preserving its optimization and compilation decisions as well as its execution behavior on modern CPUs.
  
  Our experiments demonstrate that we can slow down benchmarks by a specific amount, which is a challenge because of the optimizations in modern CPUs, and we verified with hardware profiling that on a basic-block level, the slowdown is accurate for blocks that dominate the execution. With the benchmarks slowed down to specific speeds, we confirmed that async-profiler, JFR, JProfiler, and YourKit maintain original performance behavior and assign the same percentage of run time to methods. Additionally, we identify cases of inaccuracy caused by missing debug information, which prevents the correct identification of the relevant source code. Finally, we tested the accuracy of sampling profilers by approximating the ground truth by the slowing down of specific basic blocks and found large differences in accuracy between the profilers.
  
  We believe, our slowdown-based approach is the first practical methodology to assess the accuracy of sampling profilers for JIT-compiling systems and will enable further work to improve the accuracy of profilers.},
  acceptancerate = {0.356},
  appendix = {https://doi.org/10.5281/zenodo.16911348},
  articleno = {402},
  author = {Burchell, Humphrey and Marr, Stefan},
  blog = {https://stefan-marr.de/2025/10/can-we-know-whether-a-profiler-is-accurate/},
  doi = {10.1145/3763180},
  issn = {2475-1421},
  journal = {Proceedings of the ACM on Programming Languages},
  keywords = {Accuracy GroundTruth Java MeMyPublication Profiling Sampling myown},
  month = oct,
  number = {OOPSLAB25},
  numpages = {32},
  pdf = {https://stefan-marr.de/downloads/oopsla25-burchell-marr-divining-profiler-accuracy.pdf},
  publisher = {{ACM}},
  series = {OOPSLA'25},
  title = {{Divining Profiler Accuracy: An Approach to Approximate Profiler Accuracy Through Machine Code-Level Slowdown}},
  year = {2025},
  month_numeric = {10}
}
</pre>
      </span>
    </li>
</ul>]]></content><author><name></name></author><category term="Research" /><category term="Java" /><category term="Benchmarking" /><category term="Research" /><category term="Profilers" /><category term="Sampling" /><category term="Instrumentation" /><category term="Tooling" /><category term="paper" /><summary type="html"><![CDATA[If you have been following the adventures of our hero over the last couple of years, you might remember that we can’t really trust sampling profilers for Java, and it’s even worse for Java’s instrumentation-based profilers.]]></summary></entry><entry><title type="html">First Day: A New Chapter at the JKU</title><link href="https://stefan-marr.de/2025/10/first-day-at-jku/" rel="alternate" type="text/html" title="First Day: A New Chapter at the JKU" /><published>2025-10-01T08:02:19+02:00</published><updated>2025-10-01T08:02:19+02:00</updated><id>https://stefan-marr.de/2025/10/first-day-at-jku</id><content type="html" xml:base="https://stefan-marr.de/2025/10/first-day-at-jku/"><![CDATA[<p>It’s Wednesday. Is this important? It’s my first day in a new position. So, perhaps the real question is: what’s going to be important to me from now on?</p>

<p>Let’s get the titles out of the way first:
Today is my first day as <em>Universitäts­professor</em>. That’s a <em>full professor</em>, <em>chair</em>, <em>W3 Professor</em>, <em>gewoon hoogleraar</em>, or similar. Yeah, there are lots of <a href="https://en.wikipedia.org/wiki/List_of_academic_ranks">different names in different countries</a>. It’s also my first day as the head of the <a href="https://ssw.jku.at/">Institute for System Software</a>.
The term <em>institute</em> is used here for something that’s a research group in many other places.
This means I have the opportunity to work with a number of very smart people to offer university courses in the field of programming languages, compilers, and more broadly <em>system software</em>.
It also means I am asked to advise, mentor, and support others in their research journey, from taking their very first steps, up to becoming their own independent academics, and professors in their own right.
To me, this sounds fun. I am asked to help people learn, pursue knowledge, and develop their skills. Something I not only enjoy, but also find important to prepare the next generation to tackle the problems of our time.
However, this also means I reached the end of a journey.
That’s it. I am a full professor now, and I have convinced enough people that I am not entirely terrible at this job. Or so we all hope…</p>

<p>At this point, I already have to thank all the people at the JKU for the very warm welcome I received over the last few weeks. Particularly, thank you Peter, Herbert, Markus, and Karin, for all the support to get me started here! Similarly, I wouldn’t be here without my dear colleagues and mentors at <a href="https://stefan-marr.de/2025/07/last-day-at-kent/">Kent</a> and in the wider programming language research community. You know who you are, I hope.</p>

<h2 id="what-now">What Now?</h2>

<p>With the new job and responsibilities, I need to think about what’s now important to me.
What follows isn’t a detailed plan.
I had already been asked to formulate one of those, and I’ll continue to work on realizing it.
Instead,
I wanted to think here a bit broader.</p>

<h3 id="teaching-advocate-for-fundamentals">Teaching: Advocate for Fundamentals</h3>

<p>Let’s start with teaching, since my first lectures will already be next week.</p>

<p>Our institute teaches various courses, including software development, compiler construction, advanced compiler construction, system software, dynamic compilation and run-time optimization, and principles of programming languages.</p>

<p>My impression from early discussions with colleagues is that I will need to work on making sure that we can keep teaching these fundamental topics in the future. While there seems to be a very strong push for <em>AI everything</em>, I remain to be convinced that this means that the fundamentals are any less important. On the contrary, it feels that we need to keep reminding people of <em>classic</em> techniques that are guaranteed to work, are correct, and efficient. So, when it comes to teaching, I think an important part of my job will be advocating for the fundamentals.</p>

<p>Of course, looking at the material I’ll teach this term on <a href="https://ssw.jku.at/Teaching/Lectures/CB/VL/">compiler construction</a> and <a href="https://ssw.jku.at/Teaching/Lectures/SSW/">system software</a>, perhaps I can adapt it in future years. Currently, 6 out of 13 compiler construction lectures are on parsing. This makes me want to work out what the most useful learning outcomes for such a course should be today.</p>

<h3 id="research-take-risks-and-pursue-problems-too-hard-for-industry">Research: Take Risks and Pursue Problems Too Hard for Industry</h3>

<p>Some people seem to advocate for exploring new things and expanding one’s horizon when reaching this career level.
Indeed, I have the chance to take risks, explore new research topics and communities, and ways of working.</p>

<p>If there’s a single tag line for the work I have in mind, it might be: improve language implementations to better enable old and new kinds of applications. 
After all, I like to explore ideas that enable developers to make better use of computing systems.</p>

<p>This will take new ways of looking at problems.
For instance, with few exceptions, I have been shying away from very formal work in the past.
Though, a while ago I started dreaming of defining a new kind of high-level memory model, for which we may need a more formal approach in addition to building working prototypes. 
Looking at today’s memory models, they seem too low-level for dynamic languages such as Python and Ruby. I already gave a few talks
about the background of this work and will also give one at <a href="https://conf.researchr.org/details/icfp-splash-2025/sponsor-invited-talks-2025/4/Python-Is-It-Being-Killed-by-Incremental-Improvements-">SPLASH</a>.
This will be a huge project, and a risky one. Not least because it’s unclear whether the language communities care enough about the issue until they start suffering from not having a memory model more notably.</p>

<p>And then there is interpreter performance, a topic I have been working on for a long time already.
Since I am now in a group with a long history in the area of compilers,
I would like to double down on generating fast interpreters.
Interpreters, the way we build them today, have a lot of headroom in terms of performance.
The classic ones, implemented in C/C++, and even more so, the ones on top of meta-compilation systems.
The work of <a href="https://arxiv.org/abs/2411.11469">Haoran Xu</a> suggests that we can do much better.
Unfortunately, it’s a really hard problem, for various reasons.
Something that doesn’t fit into the short and mid-term priorities of most companies.
But we can chip away at it slowly and steadily, benefiting lots of programming languages in the process.</p>

<p>I’ll also continue to work with my colleagues at Oracle on compiler topics
and with colleagues from <a href="https://stefan-marr.de/2025/07/last-day-at-kent/">PLAS</a>. We’ll keep doing fun stuff, some of which we’ll present at SPLASH in two weeks, including work on making <a href="https://stefan-marr.de/2025/08/how-to-slow-down-a-program/">programs slower (yes, slower!)</a> and <a href="https://stefan-marr.de/downloads/oopsla25-burchell-marr-divining-profiler-accuracy.pdf">approximating the ground truth profile for sampling profilers</a>.</p>

<p>I’ll stop here for now. Seems like I do need to get on with the actual job…
somewhere in <a href="https://www.jku.at/campus/der-jku-campus/gebaeude/science-park-3/">Science Park 3</a>.
I am looking forward to starting to work with all my new colleagues at the JKU and seeing which new collaborations and cooperations we can begin.
If you’re a student and interested in a project, please see the <a href="https://ssw.jku.at/Teaching/Projects/open.html">Open Project’s page</a>, where I will post more concrete project ideas in the future.</p>

<p>I suppose I’ll also occasionally still be on 
<a href="https://mastodon.acm.org/@smarr/115297555824308876">Mastodon</a>,
<a href="https://bsky.app/profile/stefan-marr.de/post/3m24gusnpnk2l">BlueSky</a>, and
<a href="https://x.com/smarr/status/1973277801689260347">Twitter</a>.</p>]]></content><author><name></name></author><category term="Personal" /><category term="Personal" /><category term="Linz" /><summary type="html"><![CDATA[It’s Wednesday. Is this important? It’s my first day in a new position. So, perhaps the real question is: what’s going to be important to me from now on?]]></summary></entry><entry><title type="html">How to Slow Down a Program? And Why it Can Be Useful.</title><link href="https://stefan-marr.de/2025/08/how-to-slow-down-a-program/" rel="alternate" type="text/html" title="How to Slow Down a Program? And Why it Can Be Useful." /><published>2025-08-27T12:20:04+02:00</published><updated>2025-08-27T12:20:04+02:00</updated><id>https://stefan-marr.de/2025/08/how-to-slow-down-a-program</id><content type="html" xml:base="https://stefan-marr.de/2025/08/how-to-slow-down-a-program/"><![CDATA[<p>Most research on programming language performance asks a variation of a single question: how can we make some specific program faster?
Sometimes we may even investigate how we can use less memory.
This means a lot of research focuses solely on reducing the amount of resources needed to achieve some computational goal.</p>

<p>So, why on earth might we be interested in slowing down programs then?</p>

<h2 id="slowing-down-programs-is-surprisingly-useful">Slowing Down Programs is Surprisingly Useful!</h2>

<p>Making programs slower can be useful to find race conditions,
to simulate speedups, and to assess how accurate profilers are.</p>

<p>To detect race conditions,
we may want to use an approach similar to fuzzing.
Instead of exploring a program’s implementation
by varying its input,
we can explore different instruction interleavings, thread or event schedules,
by slowing down program parts to change timings.
This approach allows us to identify concurrency bugs
and is used by <a href="https://www.usenix.org/legacy/event/osdi08/tech/full_papers/musuvathi/musuvathi.pdf">CHESS</a>, <a href="https://people.cs.uchicago.edu/~shanlu/paper/eurosys23.pdf">WAFFLE</a>, and <a href="https://drops.dagstuhl.de/storage/00lipics/lipics-vol333-ecoop2025/LIPIcs.ECOOP.2025.9/LIPIcs.ECOOP.2025.9.pdf">NACD</a>.</p>

<p>The <a href="https://github.com/plasma-umass/coz">Coz profiler</a> is an example of how slowing down programs can be used to simulate speedup.
With Coz, we can estimate whether an optimization is beneficial
before implementing it.
Coz simulates it by slowing down <em>all other</em> program parts.
The part we think might be optimizable stays at the same speed
it was before, but is now <em>virtually sped up</em>, which allows us to see
whether it gives enough of a benefit to justify a perhaps lengthy optimization project.</p>

<p>And, as mentioned before, we can also use it to assess how accurate profilers are.
Though, I’ll leave this for the next blog posts. :)</p>

<p>The current approaches to slowing down programs for these use cases are
rather coarse-grained though. Race detection often adapts the scheduler or uses, for example, APIs such as <code class="language-plaintext highlighter-rouge">Thread.sleep()</code>.
Similarly, Coz pauses the execution of the other threads.
<a href="https://plv.colorado.edu/papers/mytkowicz-pldi10.pdf">Work</a> on measuring whether profilers give actionable results,
inserts bytecodes into Java programs to compute Fibonacci numbers.</p>

<p>By using more fine-grained slowdowns,
we think we could make race detection, speedup estimation, and profiler accuracy assessments more precise. Thus, we looked into inserting slowdown instructions into basic blocks.</p>

<h2 id="which-x86-instructions-allow-us-to-consistently-slow-down-basic-blocks">Which x86 Instructions Allow us to Consistently Slow Down Basic Blocks?</h2>

<p>Let’s assume we run on some x86 processor, and we are looking at programs
from the perspective of processors.</p>

<p>When running a benchmark like <a href="https://github.com/smarr/are-we-fast-yet/blob/master/benchmarks/Java/src/Towers.java#L74">Towers</a>,
the OpenJDK’s HotSpot JVM may compile it to x86 instructions like this:</p>

<figure class="highlight"><pre><code class="language-nasm" data-lang="nasm"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
</pre></td><td class="code"><pre><span class="nf">mov</span> <span class="kt">dword</span> <span class="nv">ptr</span> <span class="p">[</span><span class="nb">rsp</span><span class="o">+</span><span class="mh">0x18</span><span class="p">],</span> <span class="nb">r8d</span>
<span class="nf">mov</span> <span class="kt">dword</span> <span class="nv">ptr</span> <span class="p">[</span><span class="nb">rsp</span><span class="p">],</span> <span class="nb">ecx</span>
<span class="nf">mov</span> <span class="kt">qword</span> <span class="nv">ptr</span> <span class="p">[</span><span class="nb">rsp</span><span class="o">+</span><span class="mh">0x20</span><span class="p">],</span> <span class="nb">rsi</span>
<span class="nf">mov</span> <span class="nb">ebx</span><span class="p">,</span> <span class="kt">dword</span> <span class="nv">ptr</span> <span class="p">[</span><span class="nb">rsi</span><span class="o">+</span><span class="mh">0x10</span><span class="p">]</span>
<span class="nf">mov</span> <span class="nb">r9d</span><span class="p">,</span> <span class="nb">edx</span>
<span class="nf">cmp</span> <span class="nb">edx</span><span class="p">,</span> <span class="mh">0x1</span>
<span class="nf">jnz</span> <span class="mi">0</span><span class="nv">x...</span> <span class="o">&lt;</span><span class="nb">Bl</span><span class="nv">ock</span> <span class="mi">55</span><span class="o">&gt;</span>	
</pre></td></tr></tbody></table></code></pre></figure>

<p>This is one of the basic blocks produced by HotSpot’s C2 compiler.
For our purposes, it suffices to see that there are some memory accesses
with the <code class="language-plaintext highlighter-rouge">mov</code> instructions, and we end up checking whether the <code class="language-plaintext highlighter-rouge">edx</code> register
contains the value 1. If that’s not the case, we jump to Block 55.
Otherwise, execution continues in the next basic block.
A key property of a basic block is that there’s no control flow inside of it,
which means once it starts executing, all of its instructions will execute.</p>

<p>Though, how can we slow it down?</p>

<p>x86 has many many different instructions one could try to insert into the block,
which each will probably consume CPU cycles.
However, modern CPUs try to execute as many instructions
as possible at the same time using out-of-order execution.
This means, instructions in our basic block
that do not directly depend on each other
might be executed at the same time.
For instance, the first three <code class="language-plaintext highlighter-rouge">mov</code> instructions access neither the same register
nor memory location. This means the order in which they are executed here does not matter.
Though, which optimizations CPUs apply depends on the program and the specific CPU generation,
or rather microarchitecture.</p>

<p>To find suitable instructions to slow down basic blocks,
we experimented only on an Intel Core i5-10600 CPU,
which has the <a href="https://en.wikipedia.org/wiki/Comet_Lake">Comet Lake-S microarchitecture</a>.
On other microarchitectures, things can be very different.</p>

<p>For the slowdown that we want, 
we can use <code class="language-plaintext highlighter-rouge">nop</code> or <code class="language-plaintext highlighter-rouge">mov regX, regX</code> instructions on Comet Lake-S.
This <code class="language-plaintext highlighter-rouge">mov</code> would move the value from register <code class="language-plaintext highlighter-rouge">X</code> to itself, so basically does nothing.
These two instructions give us a slowdown
that is small enough to slow down most blocks accurately to a desired target speed,
and the slowdown seems to affect only the specific block it is meant for.</p>

<p>Our basic block from earlier would then perhaps end up with <code class="language-plaintext highlighter-rouge">nop</code> instructions
interleaved after each instruction.
In practice, the number of instructions we need to insert
depends on how much time a basic block takes in the program.
Though, for illustration, it might look like this:</p>

<figure class="highlight"><pre><code class="language-nasm" data-lang="nasm"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
</pre></td><td class="code"><pre><span class="nf">mov</span> <span class="kt">dword</span> <span class="nv">ptr</span> <span class="p">[</span><span class="nb">rsp</span><span class="o">+</span><span class="mh">0x18</span><span class="p">],</span> <span class="nb">r8d</span>
<span class="nf">nop</span>
<span class="nf">mov</span> <span class="kt">dword</span> <span class="nv">ptr</span> <span class="p">[</span><span class="nb">rsp</span><span class="p">],</span> <span class="nb">ecx</span>
<span class="nf">nop</span>
<span class="nf">mov</span> <span class="kt">qword</span> <span class="nv">ptr</span> <span class="p">[</span><span class="nb">rsp</span><span class="o">+</span><span class="mh">0x20</span><span class="p">],</span> <span class="nb">rsi</span>
<span class="nf">nop</span>
<span class="nf">mov</span> <span class="nb">ebx</span><span class="p">,</span> <span class="kt">dword</span> <span class="nv">ptr</span> <span class="p">[</span><span class="nb">rsi</span><span class="o">+</span><span class="mh">0x10</span><span class="p">]</span>
<span class="nf">nop</span>
<span class="nf">mov</span> <span class="nb">r9d</span><span class="p">,</span> <span class="nb">edx</span>
<span class="nf">nop</span>
<span class="nf">cmp</span> <span class="nb">edx</span><span class="p">,</span> <span class="mh">0x1</span>
<span class="nf">nop</span>
<span class="nf">jnz</span> <span class="mi">0</span><span class="nv">x...</span> <span class="o">&lt;</span><span class="nb">Bl</span><span class="nv">ock</span> <span class="mi">55</span><span class="o">&gt;</span>	
</pre></td></tr></tbody></table></code></pre></figure>

<p>We tried six different candidates, including a <code class="language-plaintext highlighter-rouge">push</code>-<code class="language-plaintext highlighter-rouge">pop</code> sequence, to get a better impression
of how Comet Lake-S deals with them.
For more details of how and what we tried, please have a look at our <a href="#paper">short paper below</a>, which we will present at the <a href="https://conf.researchr.org/home/icfp-splash-2025/vmil-2025#event-overview">VMIL workshop</a>.</p>

<p>When inserting these instructions into basic blocks,
so that each individual basic block
takes about twice as much time as before,
we end up with a program that indeed is overall twice as slow, as one would hope.
Even better, when we look at the Towers benchmark with the <a href="https://github.com/async-profiler/async-profiler">async-profiler</a> for HotSpot, and compare the proportions of run time it
attributes to each method, the slowed-down and the normal version
match almost perfectly, as <a href="#fig1">illustrated below</a>.
The same is not true for the other candidates we looked at.</p>

<figure id="fig1">
<img src="/assets/2025/08/AsyncSlowdownVsNoSlowdownGrid.svg" height="300" />
<figcaption><strong>Figure 1:</strong> A scatter plot per slowdown instruction with the median run-time percentage for the top six Java methods of Towers. The <em>X=Y</em> diagonal indicates that a method’s run‐time percentage remains the same with and without slowdown.</figcaption>
</figure>

<p>The paper has a few more details, including a more detailed analysis
of the slowdown each candidate introduces,
how precise the slowdown is for all basic blocks
in the benchmark, and whether it makes a difference when we put the slowdown
all at the beginning, interleaved, or at the end.</p>

<p>Of course, this work is merely a stepping stone to more interesting things,
which I will look at in a bit more detail in the next post.</p>

<p>Until then, the paper is linked below, and questions, pointers, and suggestions are welcome on 
<a href="https://mastodon.acm.org/@smarr/115100270023431067">Mastodon</a>,
<a href="https://bsky.app/profile/stefan-marr.de/post/3lxetckoyps2h">BlueSky</a>, or
<a href="https://x.com/smarr/status/1960650722170552800">Twitter</a>.</p>

<p>Update: The <a href="https://youtu.be/xNwup0qn87g">recording of the talk</a> is now on YouTube.</p>

<p><a id="paper"></a></p>

<p><strong>Abstract</strong></p>

<blockquote>
  <p>Slowing down programs has surprisingly many use cases: it helps finding race conditions, enables speedup estimation, and allows us to assess a profiler’s accuracy. Yet, slowing down a program is complicated because today’s CPUs and runtime systems can optimize execution on the fly, making it challenging to preserve a program’s performance behavior to avoid introducing bias.</p>

<p>We evaluate six x86 instruction candidates for controlled and fine-grained slowdown including NOP, MOV, and PAUSE. We tested each candidate’s ability to achieve an overhead of 100%, to maintain the profiler-observable performance behavior, and whether slowdown placement within basic blocks influences results. On an Intel Core i5-10600, our experiments suggest that only NOP and MOV instructions are suitable. We believe these experiments can guide future research on advanced developer tooling that utilizes fine-granular slowdown at the machine-code level.</p>

</blockquote>

<ul>
  <li>Evaluating Candidate Instructions for Reliable Program Slowdown at the Compiler Level: Towards Supporting Fine-Grained Slowdown for Advanced Developer Tooling<br />
    
      
      H. Burchell,
      <em>
      S. Marr</em>;

    
        In Proceedings of the 17th ACM SIGPLAN International Workshop on Virtual Machines and Intermediate Languages,
      

    VMIL'25,
    p. 8,

    ACM,
    2025.

    </li>

    <li>
      Paper:
        <a href="https://stefan-marr.de/downloads/vmil25-burchell-marr-evaluating-candidate-instructions-for-reliable-program-slowdown-at-the-compiler-level.pdf">
          PDF</a>
    </li>

    <li>
        DOI: <a href="https://doi.org/10.1145/3759548.3763374">10.1145/3759548.3763374</a>
    </li>

    


    <li>
      BibTex:
      <span tabindex="0" class="bibtex"><span class="biblink">bibtex</span>
      <pre>@inproceedings{Burchell:2025:SlowCandidates,
  abstract = {Slowing down programs has surprisingly many use cases: it helps finding race conditions, enables speedup estimation, and allows us to assess a profiler's accuracy. Yet, slowing down a program is complicated because today's CPUs and runtime systems can optimize execution on the fly, making it challenging to preserve a program's performance behavior to avoid introducing bias.
  
  We evaluate six x86 instruction candidates for controlled and fine-grained slowdown including NOP, MOV, and PAUSE. We tested each candidate’s ability to achieve an overhead of 100%, to maintain the profiler-observable performance behavior, and whether slowdown placement within basic blocks influences results. On an Intel Core i5-10600, our experiments suggest that only NOP and MOV instructions are suitable. We believe these experiments can guide future research on advanced developer tooling that utilizes fine-granular slowdown at the machine-code level.},
  author = {Burchell, Humphrey and Marr, Stefan},
  blog = {https://stefan-marr.de/2025/08/how-to-slow-down-a-program/},
  booktitle = {Proceedings of the 17th ACM SIGPLAN International Workshop on Virtual Machines and Intermediate Languages},
  doi = {10.1145/3759548.3763374},
  isbn = {979-8-4007-2164-9/2025/10},
  keywords = {Benchmarking HotSpot ISA Instructions Java MeMyPublication assembly evaluation myown slowdown x86},
  location = {Singapore},
  month = oct,
  pages = {8},
  pdf = {https://stefan-marr.de/downloads/vmil25-burchell-marr-evaluating-candidate-instructions-for-reliable-program-slowdown-at-the-compiler-level.pdf},
  publisher = {{ACM}},
  series = {VMIL'25},
  title = {{Evaluating Candidate Instructions for Reliable Program Slowdown at the Compiler Level: Towards Supporting Fine-Grained Slowdown for Advanced Developer Tooling}},
  year = {2025},
  month_numeric = {10}
}
</pre>
      </span>
    </li>
</ul>]]></content><author><name></name></author><category term="Research" /><category term="Java" /><category term="Benchmarking" /><category term="Research" /><category term="Profilers" /><category term="Sampling" /><category term="Instrumentation" /><category term="Tooling" /><category term="paper" /><summary type="html"><![CDATA[Most research on programming language performance asks a variation of a single question: how can we make some specific program faster? Sometimes we may even investigate how we can use less memory. This means a lot of research focuses solely on reducing the amount of resources needed to achieve some computational goal.]]></summary></entry><entry><title type="html">It’s Thursday, and My Last* Day at Kent</title><link href="https://stefan-marr.de/2025/07/last-day-at-kent/" rel="alternate" type="text/html" title="It’s Thursday, and My Last* Day at Kent" /><published>2025-07-31T10:54:08+02:00</published><updated>2025-07-31T10:54:08+02:00</updated><id>https://stefan-marr.de/2025/07/last-day-at-kent</id><content type="html" xml:base="https://stefan-marr.de/2025/07/last-day-at-kent/"><![CDATA[<p>Today is the 31st of July 2025, and from tomorrow on I’ll be “between jobs”, or as Gen Z allegedly calls it, on a micro-retirement.</p>

<p>When I first came to Kent for my interview, I was thinking, I’ll do this one for practice.
I still had more than 2 years left on a research grant we just got, which promised to be lots of fun, but academic jobs for PL systems people are rare, even rarer these days.
But then I got the call from Richard Jones, offering me the position, and I never regretted taking him up on it.</p>

<p>Kent’s School of Computing was just growing its <a href="https://research.kent.ac.uk/programming-languages-systems/">Programming Languages and Systems (PLAS) group</a> and Richard, Simon Thompson, Andy King, Peter Rodgers, and many others at the School did a remarkable job in creating an environment and community that was truly supportive of young academics taking their first steps in a permanent academic post. Be it about wrestling with teaching duties, papers, reviews, reviewers, and of course grant writing. PLAS and the School of Computing was the right place for me.</p>

<p>Of course, many things changed since my start in October 2017. Perhaps most notably, Computing is now in the Kennedy building, a very nice space. But there was also that moment, where we, the young ones, became the “senior” ones. Mark, Laura, and Dominic grew well into their new roles and I can only hope that I passed on some of the extensive support I got, to the people who started after me.</p>

<p>There are many challenges ahead for my dear colleagues at Kent, but I hope, that enough of the spirit of support and community remains in the School, enabling PLAS and the next generation of academics to do great things.</p>

<p>Also a huge <em>thank you</em> to Kemi, Anna, and Janet for keeping the School afloat.</p>

<p>I’ll miss you all. Thanks for everything! And see you soon!</p>

<figure class="full"><img src="/assets/2025/07/PLAS-2023-10.jpg" />
<figcaption>Most of PLAS in October 2023</figcaption></figure>

<p><strong>*</strong> It’s a little more complicated than that, but for good reasons. Right, EPSRC? :)</p>]]></content><author><name></name></author><category term="Personal" /><category term="Personal" /><category term="Canterbury" /><summary type="html"><![CDATA[Today is the 31st of July 2025, and from tomorrow on I’ll be “between jobs”, or as Gen Z allegedly calls it, on a micro-retirement.]]></summary></entry><entry><title type="html">Instrumentation-based Profiling on JVMs is Broken!</title><link href="https://stefan-marr.de/2024/09/instrumenation-based-profiling-on-jvms-is-broken/" rel="alternate" type="text/html" title="Instrumentation-based Profiling on JVMs is Broken!" /><published>2024-09-17T11:08:10+02:00</published><updated>2024-09-17T11:08:10+02:00</updated><id>https://stefan-marr.de/2024/09/instrumenation-based-profiling-on-jvms-is-broken</id><content type="html" xml:base="https://stefan-marr.de/2024/09/instrumenation-based-profiling-on-jvms-is-broken/"><![CDATA[<p>Last year, we <a href="https://stefan-marr.de/2023/09/dont-blindly-trust-your-profiler/">looked at how well sampling profilers work on top of the JVM</a>.
Unfortunately, they suffer from issues such as safepoint bias and may not correctly attribute observed run time to the correct methods because of the complexities introduced by inlining and other compiler optimizations.</p>

<p>After looking at sampling profilers, <a href="https://github.com/HumphreyHCB">Humphrey</a> started to investigate instrumentation-based profilers and found during his initial investigation that they were giving much more consistent numbers. Unfortunately, it became quickly clear that the state-of-the-art instrumentation-based profilers on the JVM also have major issues, which results in profiles that are not representative of production performance. Since profilers are supposed to help us identify performance issues, they fail at their one job.</p>

<p>When investigating them further, we found that they interact badly with inlining and other standard optimizations. Because the profilers we found instrument JVM bytecodes, they add a lot of extra code that compiler optimizations treat as any other application code.
While this does not strictly prevent optimizations such as inlining, the extra code interferes enough with the optimization that the observable behavior of a program with and without inlining is basically identical.
In practice, this means that instrumentation-based profilers on the JVM are easily portable, but they can’t effectively guide developers to the code that would benefit most from attention, which is their main purpose.</p>

<h3 id="profilers-that-do-not-capture-production-performance-will-misguide-us">Profilers that do not capture production performance will misguide us!</h3>

<p>While they can still identify the code that is activated most often, the interaction with optimizations means that developers see mostly unoptimized behavior.
With today’s highly optimizing compilers this is unfortunate, because we may end up optimizing code that the compiler normally would have optimized for us already, and we spend time on things that likely won’t make a difference in production.</p>

<p>Let’s look at an example from our paper:</p>

<figure class="highlight"><pre><code class="language-java" data-lang="java"><span class="kd">class</span> <span class="nc">ActionA</span> <span class="o">{</span> <span class="kt">int</span> <span class="n">id</span><span class="o">;</span> <span class="kt">void</span> <span class="nf">execute</span><span class="o">()</span> <span class="o">{}</span> <span class="o">}</span>
<span class="kd">class</span> <span class="nc">ActionB</span> <span class="o">{</span> <span class="kt">int</span> <span class="n">id</span><span class="o">;</span> <span class="kt">void</span> <span class="nf">execute</span><span class="o">()</span> <span class="o">{}</span> <span class="o">}</span>
<span class="kt">var</span> <span class="n">actions</span> <span class="o">=</span> <span class="n">getMixOfManyActions</span><span class="o">();</span>
<span class="n">bubbleSortById</span><span class="o">(</span><span class="n">actions</span><span class="o">);</span>
<span class="n">framework</span><span class="o">.</span><span class="na">execute</span><span class="o">(</span><span class="n">actions</span><span class="o">);</span></code></pre></figure>

<p>In this arguably a little contrived example, we use some kind of framework,
for which we have actions that the framework applies for us.
This is probably a worst case for profilers that instrument bytecodes.
Here, the <code class="language-plaintext highlighter-rouge">execute()</code> methods would be identified as the most problematic
aspect. Though, they don’t do anything.
A just-in-time compiler like HotSpot’s C2,
would likely end up seeing a bimorphic call site to <code class="language-plaintext highlighter-rouge">execute()</code> and inline both methods.
And if the compiler heuristics are with us, it might even optimize out the empty loop
in the framework.</p>

<p>So, if we assume a sufficiently smart compiler, here our inefficient code,
that’s forced on us by a framework, is being taken care of by the compiler.
And a good profiler, would ideally guide us to the <code class="language-plaintext highlighter-rouge">bubbleSortById(.)</code> as being of interest.
Typically, we’d expect to get a good speedup here by switching to a more suitable
sort, especially since we implicitly assume there are many actions so that this code matters
in production.</p>

<p>To me this means, instrumentation-based profilers can only be a matter of last resort when sampling with its own flaws fails. They are just not useful enough as they are.</p>

<h3 id="can-we-do-better-than-profilers-that-instrument-bytecode">Can we do better than profilers that instrument bytecode?</h3>

<p>At the time, Humphrey was quite in favor of instrumentation,
because it gives very consistent results.
So, he wanted to make the results of instrumentation-based profilers more realistic.
Inspired by the work of <a href="https://dl.acm.org/doi/10.1145/3591473">Basso et al.</a>,
he built an instrumentation-based profiler into the Graal just-in-time compiler
that works more like classic instrumentation-based profilers for ahead-of-time-compiled language implementations.</p>

<p>The basic idea is <a href="#fig1">illustrated below</a>:</p>

<figure id="fig1">
<img src="/assets/2024/09/inst-based/Compiler-Phase-Instrumentation.svg" width="440" />
<figcaption><strong>Figure 1:</strong> Instrumentation-based profilers on the JVM typically insert instrumentation very early, before compilers optimize code.
In our profiler, instrumentation is inserted very late, to minimize interfering with optimizations.
</figcaption>
</figure>

<p>Instead of inserting the instrumentation right when the bytecode is loaded,
for instance with an <a href="https://docs.oracle.com/javase/8/docs/api/java/lang/instrument/package-summary.html">agent</a> or some other form of bytecode rewriting,
we move the addition of instrumentation code to a much later part of the just-in-time compilation. Most importantly, we insert it only after inlining and most optimizations are performed.
To keep the prototype simple, we insert the probes right before it is turned into the lower level IR.
At this point, there are still a few optimizations to be performed, including instruction selection and register allocation. Though, in the grand scheme of things, these are minor.</p>

<h3 id="how-much-better-is-it">How much better is it?</h3>

<p>With his prototype, Humphrey managed to achieve not only much better performance than classic instrumentation-based profilers, but also minimize interference with optimizations.
For a rough idea of the overall performance impact of this approach,
let’s have a look at <a href="#fig2">Figure 2</a>:</p>

<figure id="fig2">
<img src="/assets/2024/09/inst-based/Overhead-BoxPlot-Logarithmic.svg" width="440" />
<figcaption><strong>Figure 12:</strong> Sampling-based profilers such as Async, Honest, JFR,
Perf, and YourKit (in sampling mode) have very low overhead, though suffer from safepoint bias
and only observe samples.
YourKit and JProfiler doing instrumentation introduce overhead of two orders of magnitudes
and lead to unrealistic results because of their impact on optimizations.
Bubo, our prototype, has much lower overhead, and does not interfere with optimizations.
</figcaption>
</figure>

<p>With a few extra tricks briefly sketched in the paper, we get good attribution of where time is spent, even in the presence of inlining, reduce overhead, and benefit from the more precise results of instrumentation, because it does not have the same drawbacks of only occasionally obtaining data.</p>

<p>There’s one major open question though: what does a correct profile look like?
At the moment, we can’t assess whether our approach is correct.
Sampling profilers, as we saw <a href="https://stefan-marr.de/2023/09/dont-blindly-trust-your-profiler/">last year</a>, also do not agree on a single answer.
So, while we believe our approach is much better than classic instrumentation, we still need to find out how correct it is.</p>

<p>All results so far, and a few more technical details are in the paper linked below.
Questions, pointers, and suggestions are greatly appreciated
perhaps on 
<a href="https://mastodon.acm.org/@smarr/113152217131389091">Mastodon</a> or
Twitter <a href="https://x.com/smarr/status/1835976792668061978">@smarr</a>.</p>

<p><a id="paper"></a></p>

<p><strong>Abstract</strong></p>

<blockquote>
  <p>Profilers are crucial tools for identifying and improving application performance. However, for language implementations with just-in-time (JIT) compilation, e.g., for Java and JavaScript, instrumentation-based profilers can have significant overheads and report unrealistic results caused by the instrumentation.</p>

<p>In this paper, we examine state-of-the-art instrumentation-based profilers for Java to determine the realism of their results. We assess their overhead, the effect on compilation time, and the generated bytecode. We found that the profiler with the lowest overhead increased run time by 82x. Additionally, we investigate the realism of results by testing a profiler’s ability to detect whether inlining is enabled, which is an important compiler optimization. Our results document that instrumentation can alter program behavior so that performance observations are unrealistic, i.e., they do not reflect the performance of the uninstrumented program.</p>

<p>As a solution, we sketch late-compiler-phase-based instrumentation for just-in-time compilers, which gives us the precision of instrumentation-based profiling with an overhead that is multiple magnitudes lower than that of standard instrumentation-based profilers, with a median overhead of 23.3% (min. 1.4%, max. 464%). By inserting probes late in the compilation process, we avoid interfering with compiler optimizations, which yields more realistic results.</p>

</blockquote>

<ul>
  <li>Towards Realistic Results for Instrumentation-Based Profilers for JIT-Compiled Systems<br />
    
      
      H. Burchell,
      
      O. Larose,
      <em>
      S. Marr</em>;

    
        In Proceedings of the 21st ACM SIGPLAN International Conference on Managed Programming Languages and Runtimes,
      

    MPLR'24,
    

    ACM,
    2024.

    </li>

    <li>
      Paper:
        <a href="https://stefan-marr.de/downloads/mplr24-burchell-et-al-towards-realistic-results-for-instrumentation-based-profilers-for-jit-compiled-systems.pdf">
          PDF</a>
    </li>

    <li>
        DOI: <a href="https://doi.org/10.1145/3679007.3685058">10.1145/3679007.3685058</a>
    </li>

    


    <li>
      BibTex:
      <span tabindex="0" class="bibtex"><span class="biblink">bibtex</span>
      <pre>@inproceedings{Burchell:2024:InstBased,
  abstract = {Profilers are crucial tools for identifying and improving application performance. However, for language implementations with just-in-time (JIT) compilation, e.g., for Java and JavaScript, instrumentation-based profilers can have significant overheads and report unrealistic results caused by the instrumentation.
  
  In this paper, we examine state-of-the-art instrumentation-based profilers for Java to determine the realism of their results. We assess their overhead, the effect on compilation time, and the generated bytecode. We found that the profiler with the lowest overhead increased run time by 82x. Additionally, we investigate the realism of results by testing a profiler’s ability to detect whether inlining is enabled, which is an important compiler optimization. Our results document that instrumentation can alter program behavior so that performance observations are unrealistic, i.e., they do not reflect the performance of the uninstrumented program.
  
  As a solution, we sketch late-compiler-phase-based instrumentation for just-in-time compilers, which gives us the precision of instrumentation-based profiling with an overhead that is multiple magnitudes lower than that of standard instrumentation-based profilers, with a median overhead of 23.3% (min. 1.4%, max. 464%). By inserting probes late in the compilation process, we avoid interfering with compiler optimizations, which yields more realistic results.},
  author = {Burchell, Humphrey and Larose, Octave and Marr, Stefan},
  blog = {https://stefan-marr.de/2024/09/instrumenation-based-profiling-on-jvms-is-broken/},
  booktitle = {Proceedings of the 21st ACM SIGPLAN International Conference on Managed Programming Languages and Runtimes},
  doi = {10.1145/3679007.3685058},
  keywords = {Graal Instrumentation JVM Java MeMyPublication Optimization Profiler Profiling Sampling myown},
  month = sep,
  pdf = {https://stefan-marr.de/downloads/mplr24-burchell-et-al-towards-realistic-results-for-instrumentation-based-profilers-for-jit-compiled-systems.pdf},
  publisher = {ACM},
  series = {MPLR'24},
  title = {{Towards Realistic Results for Instrumentation-Based Profilers for JIT-Compiled Systems}},
  year = {2024},
  month_numeric = {9}
}
</pre>
      </span>
    </li>
</ul>]]></content><author><name></name></author><category term="Research" /><category term="Java" /><category term="Benchmarking" /><category term="Research" /><category term="Profilers" /><category term="Sampling" /><category term="Instrumentation" /><category term="Tooling" /><category term="paper" /><summary type="html"><![CDATA[Last year, we looked at how well sampling profilers work on top of the JVM. Unfortunately, they suffer from issues such as safepoint bias and may not correctly attribute observed run time to the correct methods because of the complexities introduced by inlining and other compiler optimizations.]]></summary></entry><entry><title type="html">5 Reasons Why Box Plots are the Better Default Choice for Visualizing Performance</title><link href="https://stefan-marr.de/2024/06/5-reasons-for-box-plots-as-default/" rel="alternate" type="text/html" title="5 Reasons Why Box Plots are the Better Default Choice for Visualizing Performance" /><published>2024-06-18T09:38:38+02:00</published><updated>2024-06-18T09:38:38+02:00</updated><id>https://stefan-marr.de/2024/06/5-reasons-for-box-plots-as-default</id><content type="html" xml:base="https://stefan-marr.de/2024/06/5-reasons-for-box-plots-as-default/"><![CDATA[<h2 id="box-plots-or-better">Box Plots, Or Better!</h2>

<p>This post is motivated by discussions I have been having for, ehm, forever?</p>

<p>To encourage others to use good research practices and avoid bar charts, I’ll argue that people should use box plots as their go-to choice when presenting performance results. Of course, box plots aren’t a one-size-fits-all solution. However, I believe they should be the preferred choice for many standard situations. For some situations, more appropriate chart types should be chosen based on careful consideration.</p>

<p>Thus, box plots should be the default choice instead of the omnipresent bar chart. Or short: Box Plots, or Better!</p>

<p>When working on performance, I usually work with just-in-time compiling language runtimes, on which I would run various experiments that I want to compare.
For examples, check the papers of <a href="https://github.com/HumphreyHCB">Humphrey</a>, <a href="https://octavelarose.github.io/">Octave</a>, and <a href="https://www.kent.ac.uk/computing/people/3165/kaleba-sophie">Sophie</a> (copies are <a href="/papers/">here</a>).
However, I believe the argument applies more generally beyond our own work.</p>

<h2 id="reason-1-performance-measurements-are-samples-from-a-distribution">Reason 1: Performance Measurements Are Samples from a Distribution</h2>

<p>When we measure the performance of a system, we usually get a data point that has been influenced by many different factors.
This is independent of whether we measure wall-clock time, the number of executed instructions, or perhaps memory.
While we can control some factors and influence others, today’s systems are often too complex for us to fully understand them. For example, cache effects, thermal properties, as well as hard- and software interactions outside our control can change performance non-deterministically. In practice, we therefore often treat the system as a black box.<sup style="font-size:60%">1</sup><span class="sidenote"><sup>1</sup> I’d encourage people to dig deeper, but I’m aware that time does not always allow for it.</span>
Treating it as a black box then of course requires us to repeat our experiments multiple times to be able to characterize the range of results that are to be expected. Statisticians would perhaps describe our measuring as “sampling a distribution”.</p>

<p>And this is the point where box plots come in. They are designed to be a convenient way to characterize distributions. Let’s assume we have an experiment A and B, and we have taken 50 measurements each. Figure 1 shows the results of our experiments as box plots.</p>

<figure class="full"><img src="/assets/2024/06/boxplot-reasons/boxplot-overview.svg" />
<figcaption>Figure 1: Box plot comparing A and B,<br />including annotations for the key elements of a box plot.</figcaption></figure>

<p>I annotated the box plot for A with some key elements, including the median, 25th, and 75th percentile. We also see the notion of an interquartile range, which tells us a bit about the shape of the result distribution and outliers, i.e., typically all measurements that are further from the 25th and 75th percentile than 1.5x the interquartile range.</p>

<p>Wikipedia has a <a href="https://en.wikipedia.org/wiki/Box_plot">good overview of box plots</a> that also goes deeper.</p>

<h2 id="reason-2-allows-detailed-visual-comparison">Reason 2: Allows Detailed Visual Comparison</h2>

<p>With box plots, we have enough details to see that the two experiments behave differently in a number of ways.</p>

<p>The median lines tell us that A is usually faster than B. However, we also see that A is not always faster than B, because the results are further spread out. In the worst case, A takes 19 seconds, which is more than B’s worst case of 15 seconds. While the main half of all data points for both experiments don’t overlap, we see that a good chunk of A’s results still fall within what’s often not considered to be outliers, i.e., the range between the 75th percentile with 1.5x of the interquartile range added.</p>

<p>By looking at the figure and comparing these plots, I believe we can get a reasonable intuition of the performance tradeoffs of the two options.</p>

<h2 id="reason-3-box-plots-give-enough-details">Reason 3: Box Plots Give Enough Details</h2>

<p>The above analysis of the results would not have been possible for instance with a classic bar chart
as shown in Figure 2.</p>

<figure class="full"><img src="/assets/2024/06/boxplot-reasons/barchart.svg" />
<figcaption>Figure 2: Bar chart comparing A and B, showing the mean and the standard deviation as error bars.</figcaption></figure>

<p>Bar charts are often used to compare the performance of two or more systems or experiments.
However, they show only three values per bar, typically a chosen “measure of centrality”, and some form of “error”. Very common are here things like the arithmetic mean, geometric mean, harmonic mean, and perhaps the median. Each of these has different properties, and one has to carefully think about <a href="https://dl.acm.org/doi/10.1145/5666.5673">which one to use</a> based on the type of data one is working with (<a href="https://dl.acm.org/doi/10.1145/1186736.1186738">or perhaps not</a>). At this point one also still has to chose how to <a href="https://kar.kent.ac.uk/33611/45/p63-kaliber.pdf">characterize measurement errors</a>.</p>

<p>This means that bar charts are less standardized than box plots, and one has to be explicit about what is shown.</p>

<figure class="full"><img src="/assets/2024/06/boxplot-reasons/barchart2.svg" />
<figcaption>Figure 3: Bar chart comparing A and B, showing the median and 25th and 75th percentile.</figcaption></figure>

<p>To just give one example, Figure 3 is the same data but shows the median and the 25th and 75th percentile
instead of the mean and standard deviation.</p>

<p>Since we show different sets of statistics, our impression of the results somewhat changes.
Of course, this is the power of visualization and picking statistics.
We can draw attention to specific aspects of the data. Figure 3 would lead me to conclude that A is always better than B, while Figure 2 would make me wonder what the underlying data looks like
to understand how we got to the depicted standard deviation.</p>

<p>Compared to our box plot in Figure 1, the choice of statistics to show,
and the reduced number of details we see here can result in misleading others and ourselves.
Thus, I’d strongly argue that bar charts are neither a good default to represent data
during data analysis, nor when presenting the final insights in a paper. They show too few details, oversimplifying an often more complex story.</p>

<h2 id="reason-4-box-plots-dont-overwhelm-with-details">Reason 4: Box Plots Don’t Overwhelm With Details</h2>

<p>Of course, we could also go in the other direction and choose a plot type
that shows much more detail.</p>

<p>Let’s start with Figure 4, which shows a violin plot. I selected here a version that
shows just the density distribution of our results. One could go and highlight specific statistics on it for clarity of course.
However, just looking at Figure 4, we get a more detailed look at how our measurements are distributed.
From this, we see very clearly that B’s results are grouped much tighter together, and at each end, i.e., at 9 and 15 seconds, there are outliers.
A on the other hand, is much more stretched out, though, a good chunk of the results are indeed roughly in the area indicated by the box plot previously. Though, what we see here also is that the area is wider and stretches from perhaps 8 to 15, only outside of which we likely have significantly fewer samples.
We did not see these details on Figure 1.</p>

<figure class="full"><img src="/assets/2024/06/boxplot-reasons/violin.svg" />
<figcaption>Figure 4: A violin plot to compare A and B showing the density distribution of the results.</figcaption></figure>

<p>For data analysis, this way of looking at the data is very helpful,
because it allows us to see the underlying distribution.
For reporting data in a paper, this might however be too detailed,
in the sense that it is not as easily interpretable visually and makes drawing
conclusions harder.</p>

<figure class="full"><img src="/assets/2024/06/boxplot-reasons/combined.svg" />
<figcaption>Figure 5: A combination of violin, box plot, and raw data. The mean is indicated as a red dot.</figcaption></figure>

<p>While not ideal for final reports, violin plots are useful during analysis.
Perhaps one wants to go even a step further and use a combination of violin and box plot
together with the raw data and the mean during analysis.
An example of this is shown in Figure 5.
While the plot is very busy and not suitable for a paper, I’d think,
it prevents us from jumping to conclusions based on data summaries.</p>

<p>If you’re analyzing your data in R, a package like <a href="https://indrajeetpatil.github.io/ggstatsplot/articles/web_only/ggbetweenstats.html">ggstatsplot</a> might be a good solution.</p>

<h2 id="reason-5-they-are-very-versatile">Reason 5: They Are Very Versatile</h2>

<p>Box plots can be used for many different purposes, independent of the type of distribution of data one wants to visualize, for different types of experiments, and to represent experimental data, as well as data summaries.</p>

<p>Because box plots visualize selected “percentile statistics”, we can use them without having to adapt them for specific experiments or types of distributions. They are nonparametric, i.e., one does not have to select any parameters for specific input data. This is useful for performance evaluations, because we do not generally know what type of distribution we are dealing with, and samples are not generally independent, which makes the use of various other statistical tools more complicated.</p>

<figure class="full"><img src="/assets/2024/06/boxplot-reasons/skewed.svg" />
<figcaption>Figure 6: Comparing experiments A, B, and C. Their data is drawn from different distributions, none of which are normal distributions. The density plot at the bottom characterizes the sample distributions more precisely.</figcaption></figure>

<p>Figure 6 shows box plots for three different distributions. Important here is that neither of these experiments gives normally distributed data. Nonetheless, we can use box plots to describe them more abstractly and see certain key details such as A being skewed to the left, B slightly less so but much more narrow, and C having outliers to the left,
with a small skew to the right.</p>

<p>So far, I have used examples where there was an experiment A and B, and perhaps C.
Though, often we may want to understand the relation of perhaps two variables.
This might be in the sense of scaling a computation over multiple processor cores. Figure 7 shows a box plot that visualizes data for such a hypothetical experiment.</p>

<figure class="full"><img src="/assets/2024/06/boxplot-reasons/scaling.svg" />
<figcaption>Figure 7: Comparing A and B as they change for increasing values of a second variable from 1 to 20.</figcaption></figure>

<p>While one would often use line charts for such scaling experiments, box plots can be used here as well.
One can still see the rough shape of a line, but we do not lose sight of the distribution of our experimental data.
Arguably, Figure 7 is very busy though, and a line chart with a confidence interval or similar would look better
(<a href="https://www.pythoncharts.com/python/line-chart-with-confidence-interval/">Python</a>, <a href="https://typethepipe.com/vizs-and-tips/ggplot-geom_ribbon-shadow-confidence-interval/">ggplot</a>).</p>

<p>We can also see that box plots “scale” reasonably well themselves in the sense
that they work for data that is spread out as well as for data that is very closely grouped.
For example for B, we see the values at x-axis point 1 to be very narrowly together.
Similarly for A, we see at 20 that data is tightly grouped.
In either case, we still have the complete power of the box plot and can draw conclusions.</p>

<p>If we would now want to summarize these results, we can of course use box plots!</p>

<figure class="full"><img src="/assets/2024/06/boxplot-reasons/summarize.svg" />
<figcaption>Figure 8: Summary of the data of Figure 7. A box plot of a box plot. In the upper half, it's a box plot over the medians. In the lower half, it's a box plot over all raw data.</figcaption></figure>

<p>Figure 8 shows a summary. The plot at the top uses the medians for each of the experiments
over the variable that went from 1 to 20. So, for A and B, we have 20 values each, and plot them as box plots.
Note, for this to be a valid statistic, technically the medians have to be derived from independent samples, so, you may need to consult your friendly neighborhood statistician.</p>

<p>In the bottom plot, I used the raw data of all experiments.
In a way this still “works”, and results in a very similar box plots in this case.
Though, here the meaning changes and of course whether you can do this with your data is something you need to ask your statistician about. I think, common wisdom in our field is to first normalize the data and then “bootstrap” it. This would give us bootstrapped medians etc. The median is then technically from a normal distribution of independent samples, and standard statistics are legal again.</p>

<h2 id="conclusion-box-plots-answer-important-questions-at-a-glance-use-box-plots-or-better">Conclusion: Box Plots Answer Important Questions At A Glance. Use Box Plots, Or Better!</h2>

<p>When it comes to writing academic papers, I do believe that box plots are a much better default choice for communicating performance results than bar charts are.</p>

<p>The key reasons for me are:</p>

<ul>
  <li>they are a concise representation of the result distribution</li>
  <li>they allow a visual comparison of more than the most basic statistics</li>
  <li>and thus, answer more questions than bar charts</li>
  <li>but without making things too complicated</li>
  <li>they are also more standardize, and thus, remain more readable when taken out of context</li>
  <li>and they can be used sensibly for a wide range of use cases</li>
</ul>

<p>So, for me box plots strike a good overall balance, which makes them a good standard choice for papers.</p>

<p>Though, as mentioned earlier, they are not a universally best choice either.
For data analysis, one would want more details, and for specific use cases
or types of data distributions, e.g., bi- or multi-modal distribution,
other types of plots are more suitable.
I can recommend this piece with <a href="https://nightingaledvs.com/ive-stopped-using-box-plots-should-you/">many examples where other types of plots than box plots may be better choices</a>.</p>

<p>For questions, comments, or suggestions, please find me on Twitter <a href="https://x.com/smarr/status/1802985144787026371">@smarr</a> or <a href="https://mastodon.acm.org/@smarr/112636733344807717">Mastodon</a>.</p>]]></content><author><name></name></author><category term="Research" /><category term="paper" /><category term="Research" /><category term="Graphs" /><category term="Statistics" /><category term="Evaluation" /><category term="R" /><summary type="html"><![CDATA[Box Plots, Or Better!]]></summary></entry><entry><title type="html">Why Are My Bytecode Interpreters Slow? Hunting Truffles with VTune</title><link href="https://stefan-marr.de/2024/02/why-are-my-bytecode-interpreters-slow/" rel="alternate" type="text/html" title="Why Are My Bytecode Interpreters Slow? Hunting Truffles with VTune" /><published>2024-02-23T13:27:55+01:00</published><updated>2024-02-23T13:27:55+01:00</updated><id>https://stefan-marr.de/2024/02/why-are-my-bytecode-interpreters-slow</id><content type="html" xml:base="https://stefan-marr.de/2024/02/why-are-my-bytecode-interpreters-slow/"><![CDATA[<p>As part of our work on the <a href="https://stefan-marr.de/papers/oopsla-larose-et-al-ast-vs-bytecode-interpreters-in-the-age-of-meta-compilation/">AST vs. Bytecode Interpreters</a> paper,
I briefly looked at how the native code of ahead-of-time-compiled bytecode loops looks
like, but except for finding much more code than what I expected,
I didn’t look too closely at what was going on.</p>

<p>After the paper was completed, I went on to make big plans, falling into the
old trap of simply assuming that I roughly knew where the interpreters spend
their time, and based on these assumptions developed grand ideas to make them faster.
None of these ideas would be easy of course, requiring possibly month of work.
Talking to my colleagues working on the Graal compiler,
I was very kindly reminded that I should <strong>know and not guess</strong> were the execution time goes.
I remember hearing that before, probably when I told my students the same thing…</p>

<p>So, here we are. This blog post has two goals: document how to build the various Truffle interpreters and how to use VTune for future me, and discuss a bit the findings of where Truffle bytecode interpreters spent much of their time.</p>

<p>To avoid focusing on implementation-specific aspects, I’ll look at four different
Truffle-based bytecode interpreters.
I’ll look at my own trusty TruffleSOM, <a href="https://github.com/oracle/graal/tree/master/espresso#java-on-truffle-coffee">Espresso</a> (a JVM on Truffle), <a href="https://github.com/oracle/graalpython/#graalpy-the-graalvm-implementation-of-python">GraalPy</a>,
and <a href="https://github.com/oracle/graal/tree/master/wasm#graalwasm">GraalWasm</a>.</p>

<p>To keep things brief, below I’ll just give the currently working magic incantations
that produce ahead-of-time-compiled binaries for the bytecode interpreters,
as well as how to run them as pure interpreters using the classic <a href="https://github.com/smarr/are-we-fast-yet/tree/master#micro-benchmarks">Mandelbrot</a> benchmark.</p>

<h2 id="building-the-interpreters">Building the Interpreters</h2>

<p>Let’s start with TruffleSOM. The following is the command line to build all dependencies and the bytecode interpreter. It also compiles it in a way that just-in-time compilation is disabled,
something which the other implementations take as a command-line flag.
The result is a binary in the same folder.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>TruffleSOM<span class="nv">$ </span>mx build-native <span class="nt">--build-native-image-tool</span> <span class="nt">--build-trufflesom</span> <span class="nt">--no-jit</span> <span class="nt">--type</span> BC
</code></pre></div></div>

<p><a href="https://github.com/oracle/graal/espresso/">Espresso</a> is part of the Graal repository
and all necessary build settings are conveniently maintained as part of the
repository. To find the folder with the binary, we can use the second command:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>espresso<span class="nv">$ </span>mx <span class="nt">--env</span> native-ce build
espresso<span class="nv">$ </span>mx <span class="nt">--env</span> native-ce graalvm-home
</code></pre></div></div>

<p><a href="https://github.com/oracle/graalpython/">GraalPy</a>
is in its own repository, though can also be built similarly.
It also prints conveniently the path where we find the result.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>graalpy<span class="nv">$ </span>mx python-svm
</code></pre></div></div>

<p>Last but not least, GraalWasm takes a little more convincing to get the same
result. Here the configuration isn’t included in the repository.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>wasm<span class="nv">$ </span><span class="nb">export </span><span class="nv">DYNAMIC_IMPORTS</span><span class="o">=</span>/substratevm,/sdk,/truffle,/compiler,/wasm,/tools
wasm<span class="nv">$ </span><span class="nb">export </span><span class="nv">COMPONENTS</span><span class="o">=</span>cmp,cov,dap,gvm,gwa,ins,insight,insightheap,lg,lsp,pro,sdk,sdkl,tfl,tfla,tflc,tflm
wasm<span class="nv">$ </span><span class="nb">export </span><span class="nv">NATIVE_IMAGES</span><span class="o">=</span>lib:jvmcicompiler,lib:wasmvm
wasm<span class="nv">$ </span><span class="nb">export </span><span class="nv">NON_REBUILDABLE_IMAGES</span><span class="o">=</span>lib:jvmcicompiler
wasm<span class="nv">$ </span><span class="nb">export </span><span class="nv">DISABLE_INSTALLABLES</span><span class="o">=</span>False
wasm<span class="nv">$ </span>mx build
wasm<span class="nv">$ </span>mx graalvm-home
</code></pre></div></div>

<p>At this point, we have our four interpreters ready for use.</p>

<h2 id="building-the-benchmarks">Building the Benchmarks</h2>

<p>As benchmark, I’ll use the Are We Fast Yet version of Mandelbrot.
This is mostly for my own convenience. For this experiment we just need a benchmark
that mostly runs inside the bytecode loop, and Mandelbrot will do a good-enough job with that.</p>

<p>TruffleSOM and Python take care of compilation implicitly, but for Java and Wasm,
we need to produce the jar and wasm files ourselves. For Wasm, I used the C++ version of the benchmark
and Emscripten to compile it.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Java<span class="nv">$ </span> ant jar   <span class="c"># creates a benchmarks.jar</span>
C++<span class="nv">$ </span>  <span class="nv">CXX</span><span class="o">=</span>em++ <span class="nv">OPT</span><span class="o">=</span><span class="s1">'-O3 -sSTANDALONE_WASM'</span> build.sh
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">-sSTANDALONE_WASM</code> flag makes sure Emscripten gives us a wasm module
that works without further issues on GraalWasm.</p>

<h2 id="running-the-benchmarks">Running the Benchmarks</h2>

<p>Executing the Mandelbrot benchmark is now relatively straightforward.
Though, I’ll skip over the full path details below.
For Espresso, GraalPy, and GraalWasm, we use the command-line flags to disable
just-in-time compilation as follows.</p>

<p>Note the executables are the ones we built above. Running for instance
<code class="language-plaintext highlighter-rouge">java --version</code> should show something like the following:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>openjdk 21.0.2 2024-01-16
OpenJDK Runtime Environment GraalVM CE 21.0.2-dev+13.1 (build 21.0.2+13-jvmci-23.1-b33)
Espresso 64-Bit VM GraalVM CE 21.0.2-dev+13.1 (build 21-espresso-24.1.0-dev, mixed mode)
</code></pre></div></div>

<p>With this, running the benchmarks uses roughly the following commands:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>som-native-interp-bc <span class="nt">-cp</span> Smalltalk Harness.som Mandelbrot 10 500
java    <span class="nt">--experimental-options</span> <span class="nt">--engine</span>.Compilation<span class="o">=</span><span class="nb">false</span> <span class="nt">-cp</span> benchmarks.jar Harness Mandelbrot 10 500
graalpy <span class="nt">--experimental-options</span> <span class="nt">--engine</span>.Compilation<span class="o">=</span><span class="nb">false</span> ./harness.py Mandelbrot 10 500
wasm    <span class="nt">--experimental-options</span> <span class="nt">--engine</span>.Compilation<span class="o">=</span><span class="nb">false</span> ./harness-em++-O3-sSTANDALONE_WASM Mandelbrot 10 500
</code></pre></div></div>

<h2 id="using-vtune">Using VTune</h2>

<p>There are various useful profilers out there, though, my colleagues specifically
asked me to have a look at VTune, and I figured, it might be a convenient way
to grab various hardware details from an Intel CPU.</p>

<p>However, I do not have direct access to an Intel workstation.
So, instead of using the VTune desktop user interface or command line,
I’ll actually use the VTune server on one of our benchmarking machines.
This was surprisingly convenient and seems useful
for rerunning previous experiments with different settings or binaries.</p>

<p>The machine is suitably protected, but I can’t recommend to use the following
in an open environment:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vtune-server <span class="nt">--web-port</span> <span class="nv">$PORT</span> <span class="nt">--enable-server-profiling</span> <span class="nt">--reset-passphrase</span>
</code></pre></div></div>

<p>This prints the URL where the web interface can be accessed, and is configured
so that we can run experiments directly from the interface, which helps
with finding the various interesting options.</p>

<p>For all four interpreters, I’ll focus on what VTune calls the Hotspots profiling
runs. I used the <em>Hardware Event-Based Sampling</em> setting with additional
performance insights.</p>

<p>After it finished running the benchmark,
VTune opens a <em>Summary</em> of the results with various statistics.
Though, for this investigation most interesting is the Bottom-up view of where the program spent its time.
For all four interpreters, the top function is the bytecode loop.</p>

<p>Opening the top function allows us to view the assembly,
and group by <em>Basic Block / Address</em>. This neatly adds up the time of each instruction in a basic block, and gives us an impression of how much time we spent in each block.</p>

<h2 id="the-bytecode-dispatch">The Bytecode Dispatch</h2>

<p>VTune gives us a convenient way to identify which parts of the compiled code
are executed and how much time we spent in it.
What surprised me is that about 50% of all time is spent in bytecode dispatch.
Not the bytecode operation itself, no, but the code executed for every single
bytecode leading up to and including the dispatch.</p>

<p>Below is the full code of the “bytecode dispatch” for GraalWasm.
As far as I can see, all four interpreters have roughly the same native code
structure. It starts with a very long sequence of instructions that likely
read various bits out of Truffle’s <code class="language-plaintext highlighter-rouge">VirtualFrame</code> objects, and then proceeds
to do the actual dispatch via what the Graal compiler calls an <code class="language-plaintext highlighter-rouge">IntegerSwitchNode</code> in its intermediate representation,
for which the <code class="language-plaintext highlighter-rouge">RangeTableSwitchOp</code> strategy is used for compilation.
This encodes the bytecode dispatch by looking up the
jump target in a table, and then performing the <code class="language-plaintext highlighter-rouge">jmp %rdx</code> instruction at the
very end of the code below.</p>

<style>
.full { width: 90% }
.full td:nth-child(3) { text-align: right }
</style>

<!-- prettier-ignore-start -->

<table class="full">
  <thead>
    <tr>
      <th>Address</th>
      <th>Assembly</th>
      <th>CPU Time</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td> </td>
      <td><strong>Block 18</strong></td>
      <td><strong>11.963s</strong></td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b229</code></td>
      <td><code class="language-plaintext highlighter-rouge">xorpd %xmm0, %xmm0</code></td>
      <td>0.306s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b22d</code></td>
      <td><code class="language-plaintext highlighter-rouge">mov $0xfffffffffffff, %rdx</code></td>
      <td>0.486s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b237</code></td>
      <td><code class="language-plaintext highlighter-rouge">movq %rdx, 0x1f0(%rsp)</code></td>
      <td>0.080s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b23f</code></td>
      <td><code class="language-plaintext highlighter-rouge">mov $0xeae450, %rdx</code></td>
      <td>0.070s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b249</code></td>
      <td><code class="language-plaintext highlighter-rouge">mov %r8d, %ebx</code></td>
      <td>0.346s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b24c</code></td>
      <td><code class="language-plaintext highlighter-rouge">movzxb 0x10(%r10,%rbx,1), %ebx</code></td>
      <td>0.391s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b252</code></td>
      <td><code class="language-plaintext highlighter-rouge">movq 0x18(%rdi), %rbp</code></td>
      <td>0.050s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b256</code></td>
      <td><code class="language-plaintext highlighter-rouge">movq %rbp, 0xd0(%rsp)</code></td>
      <td>0.030s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b25e</code></td>
      <td><code class="language-plaintext highlighter-rouge">movq 0x10(%rdi), %rbp</code></td>
      <td>0.256s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b262</code></td>
      <td><code class="language-plaintext highlighter-rouge">movq %rbp, 0xc8(%rsp)</code></td>
      <td>0.441s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b26a</code></td>
      <td><code class="language-plaintext highlighter-rouge">lea (%r14,%rbp,1), %rsi</code></td>
      <td>0.256s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b26e</code></td>
      <td><code class="language-plaintext highlighter-rouge">movq %rsi, 0xc0(%rsp)</code></td>
      <td>0.020s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b276</code></td>
      <td><code class="language-plaintext highlighter-rouge">lea 0xe(%r8), %esi</code></td>
      <td>0.611s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b27a</code></td>
      <td><code class="language-plaintext highlighter-rouge">lea 0xd(%r8), %ebp</code></td>
      <td>0.356s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b27e</code></td>
      <td><code class="language-plaintext highlighter-rouge">lea 0xc(%r8), %edi</code></td>
      <td>0.135s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b282</code></td>
      <td><code class="language-plaintext highlighter-rouge">lea 0xb(%r8), %r11d</code></td>
      <td>0.055s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b286</code></td>
      <td><code class="language-plaintext highlighter-rouge">lea 0xa(%r8), %r13d</code></td>
      <td>0.306s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b28a</code></td>
      <td><code class="language-plaintext highlighter-rouge">movl %r13d, 0x1ec(%rsp)</code></td>
      <td>0.401s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b292</code></td>
      <td><code class="language-plaintext highlighter-rouge">lea 0x9(%r8), %r12d</code></td>
      <td>0.125s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b296</code></td>
      <td><code class="language-plaintext highlighter-rouge">movl %r12d, 0x1e8(%rsp)</code></td>
      <td>0.065s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b29e</code></td>
      <td><code class="language-plaintext highlighter-rouge">lea 0x8(%r8), %edx</code></td>
      <td>0.326s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b2a2</code></td>
      <td><code class="language-plaintext highlighter-rouge">lea 0x7(%r8), %ecx</code></td>
      <td>0.416s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b2a6</code></td>
      <td><code class="language-plaintext highlighter-rouge">movl %esi, 0x1e4(%rsp)</code></td>
      <td>0.115s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b2ad</code></td>
      <td><code class="language-plaintext highlighter-rouge">lea 0x6(%r8), %esi</code></td>
      <td>0.030s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b2b1</code></td>
      <td><code class="language-plaintext highlighter-rouge">movl %ebp, 0x1e0(%rsp)</code></td>
      <td>0.311s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b2b8</code></td>
      <td><code class="language-plaintext highlighter-rouge">lea 0x5(%r8), %ebp</code></td>
      <td>0.356s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b2bc</code></td>
      <td><code class="language-plaintext highlighter-rouge">movl %edi, 0x1dc(%rsp)</code></td>
      <td>0.150s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b2c3</code></td>
      <td><code class="language-plaintext highlighter-rouge">lea 0x4(%r8), %edi</code></td>
      <td>0.040s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b2c7</code></td>
      <td><code class="language-plaintext highlighter-rouge">movl %r11d, 0x1d8(%rsp)</code></td>
      <td>0.321s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b2cf</code></td>
      <td><code class="language-plaintext highlighter-rouge">lea 0x3(%r8), %r11d</code></td>
      <td>0.416s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b2d3</code></td>
      <td><code class="language-plaintext highlighter-rouge">movl %r11d, 0x1d4(%rsp)</code></td>
      <td>0.135s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b2db</code></td>
      <td><code class="language-plaintext highlighter-rouge">lea 0x2(%r8), %r13d</code></td>
      <td>0.065s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b2df</code></td>
      <td><code class="language-plaintext highlighter-rouge">movl %r13d, 0x1d0(%rsp)</code></td>
      <td>0.276s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b2e7</code></td>
      <td><code class="language-plaintext highlighter-rouge">mov %r8d, %r12d</code></td>
      <td>0.426s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b2ea</code></td>
      <td><code class="language-plaintext highlighter-rouge">inc %r12d</code></td>
      <td>0.125s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b2ed</code></td>
      <td><code class="language-plaintext highlighter-rouge">movl %r12d, 0x1cc(%rsp)</code></td>
      <td>0.045s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b2f5</code></td>
      <td><code class="language-plaintext highlighter-rouge">movl %edx, 0x1c8(%rsp)</code></td>
      <td>0.321s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b2fc</code></td>
      <td><code class="language-plaintext highlighter-rouge">mov %r9d, %edx</code></td>
      <td>0.441s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b2ff</code></td>
      <td><code class="language-plaintext highlighter-rouge">inc %edx</code></td>
      <td>0.120s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b301</code></td>
      <td><code class="language-plaintext highlighter-rouge">movl %edx, 0x1c4(%rsp)</code></td>
      <td>0.040s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b308</code></td>
      <td><code class="language-plaintext highlighter-rouge">lea -0x3(%r9), %edx</code></td>
      <td>0.366s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b30c</code></td>
      <td><code class="language-plaintext highlighter-rouge">movl %edx, 0x1c0(%rsp)</code></td>
      <td>0.311s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b313</code></td>
      <td><code class="language-plaintext highlighter-rouge">lea -0x2(%r9), %edx</code></td>
      <td>0.221s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b317</code></td>
      <td><code class="language-plaintext highlighter-rouge">movl %edx, 0x1bc(%rsp)</code></td>
      <td>0.045s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b31e</code></td>
      <td><code class="language-plaintext highlighter-rouge">lea 0xf(%r8), %edx</code></td>
      <td>0.376s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b322</code></td>
      <td><code class="language-plaintext highlighter-rouge">mov %r9d, %r8d</code></td>
      <td>0.366s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b325</code></td>
      <td><code class="language-plaintext highlighter-rouge">dec %r8d</code></td>
      <td>0.085s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b328</code></td>
      <td><code class="language-plaintext highlighter-rouge">movl %r8d, 0x1b8(%rsp)</code></td>
      <td>0.045s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b330</code></td>
      <td><code class="language-plaintext highlighter-rouge">movl %edx, 0x1b4(%rsp)</code></td>
      <td>0.371s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b337</code></td>
      <td><code class="language-plaintext highlighter-rouge">mov %ebx, %r9d</code></td>
      <td>0.481s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b33a</code></td>
      <td><code class="language-plaintext highlighter-rouge">cmp $0xfe, %r9d</code></td>
      <td>0.035s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b341</code></td>
      <td><code class="language-plaintext highlighter-rouge">jnbe 0x1f7c658</code></td>
      <td> </td>
    </tr>
    <tr>
      <td> </td>
      <td> </td>
      <td> </td>
    </tr>
    <tr>
      <td> </td>
      <td><strong>Block 19</strong></td>
      <td><strong>0.982s</strong></td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b347</code></td>
      <td><code class="language-plaintext highlighter-rouge">lea 0xa(%rip), %rdx</code></td>
      <td>0.035s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b34e</code></td>
      <td><code class="language-plaintext highlighter-rouge">movsxdl (%rdx,%r9,4), %r9</code></td>
      <td>0.321s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b352</code></td>
      <td><code class="language-plaintext highlighter-rouge">add %r9, %rdx</code></td>
      <td>0.506s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b355</code></td>
      <td><code class="language-plaintext highlighter-rouge">jmp %rdx</code></td>
      <td>0.120s</td>
    </tr>
  </tbody>
</table>

<!-- prettier-ignore-end -->

<p>Someone who writes bytecode interpreters directly in assembly might be mortified by this code.
Though, to me this is more of an artifact of some missed optimization opportunities in the otherwise excellent Graal compiler,
which hopefully can be fixed.</p>

<p>I won’t include the results for the other interpreters here,
but to summarize, let’s count the instructions of the bytecode dispatch for each
of them:</p>

<ul>
  <li>TruffleSOM: 31 instructions in 1 basic block (after some extra optimization)</li>
  <li>Espresso: 79 instructions in 2 basic blocks</li>
  <li>GraalPy: 81 instructions in 3 basic blocks</li>
  <li>GraalWasm: 56 instructions in 2 basic blocks</li>
</ul>

<p>For GraalPy it is even a little more complex. There are various other
basic blocks involved and none of the bytecode handler jump back directly
to the same block. Instead there seems to be some more code after each handler
before they jump back to the top of the loop.</p>

<h2 id="the-first-micro-optimization-opportunity">The First Micro-Optimization Opportunity</h2>

<p>As mentioned for TruffleSOM, I did already look into one optimization opportunity.
The very careful reader might have noticed the end of block 18 above.</p>

<!-- prettier-ignore-start -->

<table class="full">
  <thead>
    <tr>
      <th>Address</th>
      <th>Assembly</th>
      <th>CPU Time</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b33a</code></td>
      <td><code class="language-plaintext highlighter-rouge">cmp $0xfe, %r9d</code></td>
      <td>0.035s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0x1f5b341</code></td>
      <td><code class="language-plaintext highlighter-rouge">jnbe 0x1f7c658</code></td>
      <td> </td>
    </tr>
  </tbody>
</table>

<!-- prettier-ignore-end -->

<p>This is a correctness check for the <code class="language-plaintext highlighter-rouge">switch</code>/<code class="language-plaintext highlighter-rouge">case</code> statement in the Java code.
It makes sure that the value we switch over is covered by the cases in the
switch. Otherwise, we’re jumping to some default block, or just back to the
top of the loop.</p>

<p>The implementation in Graal is a little bit too hard-coded for my taste.
While it has all the logic to eliminate unreachable cases, for instance
when it sees that the value we switch over is guaranteed to exclude some cases,
it does handle the default case directly in the machine-specific lowering code.</p>

<p>Seems like one could generalize this a little more and possibly handle the default case like the other cases.
The relevant Graal issue for this micro-optimization is <a href="https://github.com/oracle/graal/issues/8425">#8425</a>.
However, when I applied this to TruffleSOM’s version of Graal, and eliminated those two
instructions, it didn’t make a difference. The remaining 31 instructions still dominate the bytecode dispatch.</p>

<h2 id="conclusion">Conclusion</h2>

<p>The most important take away message here is of course <strong>know, don’t assume</strong>,
or more specifically <strong>measure, don’t guess</strong>.</p>

<p>For the problem at hand, it looks like Graal struggles with hoisting some
common reads out of the bytecode loops. If there’s a way to fix this, this could give a massive speedup to all Truffle-based bytecode interpreters, perhaps enough to invalidate our <a href="https://stefan-marr.de/papers/oopsla-larose-et-al-ast-vs-bytecode-interpreters-in-the-age-of-meta-compilation/">AST vs. Bytecode Interpreters</a> paper.
Wouldn’t that be fun?! 🤓 🧑🏻‍🔬</p>

<p>The mentioned micro optimization would also avoid a few instructions for every <code class="language-plaintext highlighter-rouge">switch</code>/<code class="language-plaintext highlighter-rouge">case</code> in normal Java code, when it doesn’t need the default case. So, it might be relevant for more than just bytecode interpreters.</p>

<p>For questions, comments, or suggestions, please find me on Twitter <a href="https://twitter.com/smarr/status/1761420636667220338">@smarr</a> or <a href="https://mastodon.acm.org/@smarr/111987286893614873">Mastodon</a>.</p>

<p><a name="pysom"></a></p>

<h2 id="addendum-dispatch-code-for-pysoms-rpython-based-bytecode-interpreter">Addendum: Dispatch Code for PySOM’s RPython-based Bytecode Interpreter</h2>

<p>After turning my notes into this blog post yesterday, I figured today I should
also look at what RPython is doing for my PySOM bytecode interpreter.</p>

<p>The below 13 instructions are the bytecode dispatch
for the interpreter. While it is much shorter, it also contains the
safety check <code class="language-plaintext highlighter-rouge">cmp $0x45, %al</code> to make sure the bytecode is within the set of
targets. Ideally, a bytecode verifier would have ensure that already,
or perhaps we setup the native code so that there are simply no unsafe jump
targets to avoid having to check every time, which at least based on VTune
seems to consume a considerable amount of the overall run time.
<del>Also somewhat concerning is that the check is done twice.
Block 21 already has a <code class="language-plaintext highlighter-rouge">cmp $0x45, %rax</code>, which should make the second test
unnecessary.</del> (Correction: the second check was unrelated, and <a href="https://github.com/SOM-st/PySOM/pull/56">I managed to remove it</a> by applying an optimization I had already on TruffleSOM, but not yet in PySOM.)</p>

<p>So, yeah, I guess PySOM could be a bit faster on every single bytecode,
which might mean PyPy could possibly also improve its interpreted performance.</p>

<table>
  <thead>
    <tr>
      <th>Address</th>
      <th>Assembly</th>
      <th>CPU Time</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td> </td>
      <td><strong>Block 20</strong></td>
      <td>0.085s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0xb45d8</code></td>
      <td><code class="language-plaintext highlighter-rouge">cmp %r14, %rcx</code></td>
      <td>0.085s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0xb45db</code></td>
      <td><code class="language-plaintext highlighter-rouge">jle 0xb6ba0</code></td>
      <td> </td>
    </tr>
    <tr>
      <td> </td>
      <td><strong>Block 21</strong></td>
      <td>0.631s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0xb45e1</code></td>
      <td><code class="language-plaintext highlighter-rouge">movzxb 0x10(%rdx,%r14,1), %eax</code></td>
      <td>0.311s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0xb45e7</code></td>
      <td><code class="language-plaintext highlighter-rouge">cmp $0x45, %rax</code></td>
      <td>0.321s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0xb45eb</code></td>
      <td><code class="language-plaintext highlighter-rouge">jnle 0xb6c00</code></td>
      <td> </td>
    </tr>
    <tr>
      <td> </td>
      <td><strong>Block 22</strong></td>
      <td>4.601s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0xb45f1</code></td>
      <td><code class="language-plaintext highlighter-rouge">lea 0x69868(%rip), %rdi</code></td>
      <td>0.571s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0xb45f8</code></td>
      <td><code class="language-plaintext highlighter-rouge">movq 0x10(%rdi,%rax,8), %r15</code></td>
      <td>0.020s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0xb45fd</code></td>
      <td><code class="language-plaintext highlighter-rouge">add %r14, %r15</code></td>
      <td>2.942s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0xb4600</code></td>
      <td><code class="language-plaintext highlighter-rouge">cmp $0x45, %al</code></td>
      <td>1.068s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0xb4602</code></td>
      <td><code class="language-plaintext highlighter-rouge">jnbe 0xb6c59</code></td>
      <td> </td>
    </tr>
    <tr>
      <td> </td>
      <td><strong>Block 23</strong></td>
      <td>0.476s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0xb4608</code></td>
      <td><code class="language-plaintext highlighter-rouge">movsxdl (%rbx,%rax,4), %rax</code></td>
      <td>0s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0xb460c</code></td>
      <td><code class="language-plaintext highlighter-rouge">add %rbx, %rax</code></td>
      <td>0.015s</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0xb460f</code></td>
      <td><code class="language-plaintext highlighter-rouge">jmp %rax</code></td>
      <td>0.461s</td>
    </tr>
  </tbody>
</table>]]></content><author><name></name></author><category term="Research" /><category term="Truffle" /><category term="GraalVM" /><category term="Performance" /><category term="Profiling" /><category term="SOM" /><category term="Truffle" /><category term="GraalPy" /><category term="Espresso" /><summary type="html"><![CDATA[As part of our work on the AST vs. Bytecode Interpreters paper, I briefly looked at how the native code of ahead-of-time-compiled bytecode loops looks like, but except for finding much more code than what I expected, I didn’t look too closely at what was going on.]]></summary></entry></feed>