Irrationally Annoyed: The SIGPLAN Blog Post writing 30 Years of PL Research Out of Existence

I started writing this post when being very very annoyed by this blog post on the SIGPLAN blog. I could not understand how “THE SIGPLAN” blog could simply write 30 years of programming language research out of existence, only barely acknowledging Self and JavaScript. It felt like duty called

Though, after multiple rounds of edits, I am now assuming that there’s simply an issue of awareness. What follows is therefore an attempt to raise awareness on the various academic but also industrial efforts around the performance of dynamic languages.

Disclaimer: this blog post isn’t attempting to be complete in any way, but I hope it provides enough pointers for people interested in language implementation technology for dynamic languages to go deeper and possibly even contribute.

Languages vs. Language Implementations

Before going into technical details, it is important to start distinguishing between a language and its implementation. Especially, when we talk about programming language research, these two are rarely the same.

While the languages mentioned in the post (Python, R, PHP) all have “standard” implementations, which are the by far most widely used ones, they are far from the only implementations. Let’s just pick three for each language: There are PyPy, IronPython, and Jython for Python, pqR, Renjin, and FastR for R, as well as HippyVM, PeachPie, and Quercus for PHP.

They all have benefits and drawbacks. And perhaps are not even 100% compatible, but various groups of people clearly spent a lot of time and energy on more optimized language implementations. Some of these implementations have state-of-the-art just-in-time compilers and can reach performance competitive with JavaScript, which itself is certainly competitive with for instance Java, and only slightly behind “metal” languages in some, and ahead in other benchmarks.

How do I know? I didn’t benchmark all of these language implementations, but I worked on comparing compilers across languages with the Are We Fast Yet project [5]. One important insight here is that language design can definitely have a performance impact. But, the ingenuity of compiler writers and language implementers has eliminated all major performance hurdles in the last 30 years.

This means, claims about the languages for instance them being “all notoriously slow and bloated” and “not designed to be fast or space-efficient” are rather misguided.

Just-in-time compilation technology as used for JavaScript, Java, Python and others is very effective, and not particularly language-specific. Some of the key techniques have been known for 30 years, as I discussed elsewhere. Others, such as storage strategies, mementos, and partial escape analysis are much more recent and help with many issues. Perhaps interesting to note is that storage strategies were actually first done in PyPy, and are now pretty much standard in just-in-time compiling VMs for a whole range of languages, including JavaScript.

Though, one may argue that these techniques have not been successful enough, because people don’t use the language implementations actually benefiting from them.

Fine, but I believe this is only partially a technical problem. Some of it is a social issue, too.

Others have spent time on identifying potential reasons of why people do not pick up our research. For instance, Laurie wrote blog posts on why not more users are happy with our VMs [1, 2].

Given the claim that JavaScript is somehow exceptional, I would argue that JavaScript was lucky in a way. JavaScript was hilariously slow to start with. For V8 this meant that they didn’t have to worry about making “startup performance” worse. Their baseline native code compiler produced code that was much better than the JavaScript interpreters back then. Perhaps unfortunately, Python has a surprisingly fast interpreter, which gives VMs such as PyPy quite a challenge. Still, it is not impossible to beat such interpreters and have just-in-time compiling VMs that have also good startup performance. Examples are indeed modern JavaScript VMs, but also much smaller and less well funded projects such as LuaJIT, which has an incredibly fast interpreter and a very effective just-in-time compiler. But indeed, it takes time and effort to build these systems, which contrary to the claims in the blog post, are indeed invested in these languages.

Is Python doomed? Of course not!

So, what about Python then? The blog post picks up a claim that Python is incredibly slow when it comes to matrix multiplication. And later, without much context claims that PyPy is perhaps 2 times faster than Python, but sometimes slower.

One could have asked the PyPy community, or looked at their blog. PyPy doesn’t have auto-vectorization as far as I know, so, indeed, it has a hard time reaching the performance of vectorized code, but it is much faster than what’s implied. Such broad claims are not just unjustified and bad style, they are also painfully unscientific. No, Python is not necessarily slow and bloated. Maps, Hidden Classes, and object shapes make it possible to store even integers rather efficiently. With the previously mentioned storage strategies, this works extremely well for your hugh array with integers, too.

Now, perhaps more important is the questions whether auto-vectorization is something dynamic languages simply can’t support. Again, there’s no fundamental reason why VMs for dynamic languages can’t support it. The GraalVM supports auto-vectorization and can speedup JavaScript, Python, Ruby, and other dynamic languages with it. Key techniques are speculative compilation, type feedback, object shapes, and storage strategies. With them, optimistic compilers can vectorize even the code of dynamic languages. For good measures, I should probably also mention Juan Fumero’s work around using just-in-time compilation for dynamic languages to target GPUs.

Are Dynamic-Languages Single Threaded?

This one, I do indeed take personally.

First of all, being effectively single threaded can be a Good Thing™. Why would you want to deal with shared memory race conditions? This is a language design question. So in the end a matter of taste. And, if one really wants to, some forms of shared memory are possible, even without introducing all its drawbacks.

That aside, Python, R, and others have fairly nice ways of supporting parallelism, including multiprocess approaches. And, to blow my own horn a little more, if you really want shared memory multithreading, you can have that efficiently, too with our work on thread-safe object models [3] and thread-safe storage strategies [4] .

Where to go from here?

To sum up: yes, metal languages will continue to be important, but so will the irrational exuberance languages. We as a community should lead the way in developing systems (at the hardware and software levels) that will make them run faster.

Yes, we should. And the PL community spent a lot of effort doing this for the last 30 years. So, please join us. We are out there and could use help!

MPLR has a keynote coming up titled “Hardware Support for Managed Languages: An Old Idea Whose Time Has Finally Come?”. Perhaps worth registering for. Attendance is free.

And of course, SPLASH is around the corner, too. With DLS and VMIL, SPLASH has a strong history of providing venues for discussing the implementation of dynamic languages. OOPSLA, as in the past, has pretty cool papers on dynamic language implementation.

Other relevant venues include for instance ICOOOLPS and MoreVMs.

My Introduction to Efficient and Safe Implementations of Dynamic Languages has many pointers to research, some of which also made it into some early overview of the field.

For questions or difference in opinion, find me on Twitter @smarr.

Post Scriptum

Part of my annoyance with the blog post is its dismissive and condescending tone.

Some of these so-called “scripting languages” are essentially moribund, like Perl (1987)

Seems a completely unnecessary point to be made. It is also unclear how this is assessed. The Perl community is active, but yes, a lot of effort went into Raku.

Elsewhere, it describes “JavaScript is an outlier”, refers to the “monoculture [of] (the browser)”, and says there’s “no need for compatibility with legacy code”. JavaScript is certainly an interesting special case, and indeed, as acknowledged, browsers just happen to have attracted a lot of funding. But a monoculture? In which sense? At least in the past, “the browser” itself was rather diverse. When it comes to backwards compatibility, the number one design goal of Ecma TC39, the group designing JavaScript, is “don’t break the web”.

And of course, browsers themselves are incredibly complex “legacy systems”. To get a taste, I’d recommend to listen to Michael Lippautz talking about how C++ and JavaScript interact.

Post Post Scriptum

Python programmers need to distinguish between time and memory spent in pure Python (optimizable) from time and memory spent in C libraries (not so much). They need help tracking down expensive and insidious traffic across the language boundaries (copying and serialization).

We could also try to get rid of it instead. Or give more power to people like Stephen Kell, and avoid copying/serialization in a different way.

Metaprogramming, Metaobject Protocols, Gradual Type Checks: Optimizing the "Unoptimizable" Using Old Ideas

Last year, I was asked to give a talk for the Meta’19 workshop. It’s a workshop on metaprogramming and reflection. The submission deadline for this year’s edition, is less than a month away: Check it out!

With my interest in making run-time metaprogramming fast, I thought it might be worthwhile to explore how techniques from the early days of just-in-time compiling virtual machines are still the key elements of optimizing modern language features.

Very naively, I looked at the relevant papers, and when they were published, and picked some of the top-10 movies of the year. I thought, somehow, there should be a story line. I wrote the abstract (below) and sent it to the organizers. They seemed to like it. Good.

A few month later, a few weeks before the actual workshop, I found myself looking at the abstract: damn, what was I thinking?

After bending and stretching the metaphors well beyond the breaking point, I ended up with a talk, that may or may not make sense. Ignoring the poor metaphors, it sketches how polymorphic inline caches (WP) and maps (also know as object shapes or hidden classes) can be combined to optimize run-time reflection, metaobject protocols, and gradual typing.

The interesting insight here is that a map/hidden class can be used to encode a wide range of properties about a group of objects, which enables compilers to optimize and generate efficient code.

Abstract

Metaobject Protocols and Type Checks, do they have much in common? Perhaps not from a language perspective. However, under the hood of a modern virtual machine, they turn out to show very similar behavior and can be optimized very similarly.

This talk will go back to the days of Terminator 2, The Naked Gun 2 1/2, and Star Trek VI. We will revisit the early days of just-in-time compilation, the basic insights that are still true, and see how to apply them to metaprogramming techniques of different shapes and forms.

If you have any questions, I am more than happy to answer, possibly on Twitter @smarr.

Is This Noise, or Does This Mean Something? #benchmarking

Do my performance measurements allow me to conclude anything at all?

For the work on various research languages (SOMns, SOM, TruffleSOM …), I have been using continuous performance tracking for the better part of the last decade.

One would hope that I would know what I am doing…

Since about half a year, I am using ReBenchDB to track performance (before I used Codespeed), and use the same statistical tools, that I am using for academic research papers. While ReBenchDB is of great help, and built for exactly what I want to do, it made me question more strongly than before whether I can even trust these benchmark results.

Screenshort of ReBenchDB showing the results of an experiment

The screenshot shows a table with the results of an experiment for which I would like to know whether it changes the performance of my language implementation. I picked a section from the full report showing microbenchmarks, since I would expect them to be the simplest to explain, and for this particular experiment, I wouldn’t expect any changes.

On first inspection, the results seem however mixed. We got two benchmarks marked as red and three as green, which may indicate slowdowns and speedups. Though, a closer look at the boxplots indicates that we can’t really draw any strong conclusions. The boxes pretty much overlap, and my experience tells me that this may as well just be noise. I have seen similar ups and downs rather frequently on these benchmarks.

This means, I can’t tell with confidence whether the change I tried to benchmark has any kind of positive or negative impact on performance. Though, I am somewhat sure that nothing went terribly wrong. I guess that’s at least something.

One option may perhaps be to let the benchmarks run for longer. Currently, it takes about 40min for the benchmark setup to run. If the reliability of the results could be increased, perhaps running the benchmarks for an hour may still be good enough. Much longer run times will just mean that benchmarking will be ignored, so, there’s really an upper limit on how long things can take. Another option might be adding more benchmarking machines, though, currently, there’s no budget for that either.

So, what would it take to improve the reliability of my benchmarks and get numbers that I can trust more readily?

Gathering More Data to Understand What’s Going On

The benchmarking setup I am using relies on ReBench. This means, in the ideal case, I can simply ask ReBench to execute every benchmark for 3000 iterations and repeat this 30. Since I know that compilation triggers after 1000 iterations, this should give us fully warmed up execution. Having everything executed 30 times, i.e., using 30 invocations should make sure that any variation across different invocations can also be determined reliably.

This should be all we need to change about the setup:

rebench --invocation=30 --iteration=3000 rebench.conf

About 7.5 days later, I got the data I wanted.

Not something that’s practical for everyday development…

Get a Bit of an Overview of All the Data

Having the data, let’s try to get an impression of the overall situation. A line chart that shows the iterations on the x-axis and the run time of an iteration in milliseconds on the y-axis should do the job.

Though, we got 30 invocations for every benchmark. To see how these invocations differ, I’ll include all but make the lines transparent. This should allow us to see differences in behavior.

The first plot below shows the IntegerLoop benchmark:

public benchmark = ( | bounds a |
  bounds:: 20000.
  bounds negated to: bounds by: 1 do: [:value | a:: value - value].
  ^ a
)

The above code is a loop from -20000 to 20000 that sets the variable a to the result of subtracting the current loop index from itself. It’s a classic microbenchmark and I wouldn’t expect too many complications from it.

Run time of iterations for IntegerLoop benchmark on SOMns

Though, the plot shows a few interesting bits. First note that there is a dark blue line coming down and then settling at about 25ms. At iteration 1000, we can see a clear spike. At this point, the benchmark harness itself crossed the compilation threshold and got compiled. Before, only our benchmark method from above was compiled, since it was executed in an additional loop in the harness. Unfortunately, by compiling the harness as well, performance seems to decrease a bit to about 30ms, i.e. a 20% increase in run time. Since compilers rely on heuristics, these things can happen. Though, it would be good to look into why it happens here.

But back to the plot itself. We also see light blue spikes, in different shades of blue, and different sizes. These spikes likely correspond to garbage collection or compilation events. Though, overall, the result is pretty stable. Except for the undesirable slowdown, we would hope that all benchmarks look like this, since all thirty invocations would report the same result.

The next plot below shows the Vacation benchmark, which is a simulation of a travel booking system that has been used for instance to benchmark software transactional memory systems. The version used here, relies on threads and locks, even though, we run it only with a single thread.

Run time of iterations for Vacation benchmark on SOMns

All Iteration Plots

CDmacro-steadySOMns-graal
DeltaBluemacro-steadySOMns-graal
DeltaBlueNSmacro-steadySOMns-graal
GraphSearchmacro-steadySOMns-graal
Havlakmacro-steadySOMns-graal
Jsonmacro-steadySOMns-graal
Leemacro-steadySOMns-graal
LeeSTMmacro-steadySOMns-graal
Mandelbrotmacro-steadySOMns-graal
NBodymacro-steadySOMns-graal
PageRankmacro-steadySOMns-graal
Richardsmacro-steadySOMns-graal
RichardsNSmacro-steadySOMns-graal
Splaymacro-steadySOMns-graal
Vacationmacro-steadySOMns-graal
VacationSTMmacro-steadySOMns-graal
ClosureDefFibonaccimicro-steadySOMns-graal
ClosureFibonaccimicro-steadySOMns-graal
Dispatchmicro-steadySOMns-graal
Fibonaccimicro-steadySOMns-graal
FieldLoopmicro-steadySOMns-graal
IntegerLoopmicro-steadySOMns-graal
Loopmicro-steadySOMns-graal
Recursemicro-steadySOMns-graal
Summicro-steadySOMns-graal
WhileLoopmicro-steadySOMns-graal
AStarSearchsavina-jitSOMns-graal
BankTransactionsavina-jitSOMns-graal
BigContentionsavina-jitSOMns-graal
Chameneossavina-jitSOMns-graal
CigaretteSmokerssavina-jitSOMns-graal
ConcurrentDictionarysavina-jitSOMns-graal
ConcurrentSortedLinkedListsavina-jitSOMns-graal
Countingsavina-jitSOMns-graal
ForkJoinActorCreationsavina-jitSOMns-graal
ForkJoinThroughputsavina-jitSOMns-graal
LogisticsMapSeriessavina-jitSOMns-graal
NQueenssavina-jitSOMns-graal
Philosopherssavina-jitSOMns-graal
PingPongsavina-jitSOMns-graal
ProducerConsumerBoundedBuffersavina-jitSOMns-graal
RadixSortsavina-jitSOMns-graal
SleepingBarbersavina-jitSOMns-graal
ThreadRingsavina-jitSOMns-graal
TrapezoidalApproximationsavina-jitSOMns-graal
UnbalancedCobwebbedTreesavina-jitSOMns-graal

The plot is clearly less nice than the one for IntegerLoop. Only after about a 100 iterations, we reach a point where the measurements fit onto the plot. This shows that the compiler can optimize the benchmark very well, but it takes time to do so. Eventually, the benchmark reaches a run time of under 2ms, which isn’t ideal, since we might simply be measuring noise at this level of performance.

We also see that there is a much larger area that is different shades of blue. Though, there seems to be a blue core, which indicates that there is a somewhat normal-ish distribution of iteration times across our 30 invocations of the benchmark.

Just to emphasize what is visualized, I picked out one invocation, and instead of plotting it in a transparent blue, it’s plotted in red. With it we can see the ups and downs within a single invocation, but the overall trend seems to stabilize.

The plots for all benchmarks are shown in the box on the right.

How Close Can We Get to the “True” Average Performance?

After inspecting our plots, and considering the compilation threshold of 1000 iterations, it seems plausible to define the “true” average for our benchmarks based on the measurements after the 2000th iteration. In the previous plots, you may have noticed the red dashed line at the 2000th iteration. This definition gives us 1000 iterations to calculate the average from. As a somewhat stable average, I’ll use the median for now. Let’s assume for simplicity that this is fine.

Since the time benchmarking takes needs to be practical, I need to to minimize this warmup threshold as well as the number of iterations used to calculate the median, while keeping it within an acceptable range of the “true” median I calculated from the last 1000 iterations.

To get a feeling of whether this is workable, I’ll visualize how far off the median is when I take various different warmup thresholds and number of iterations to calculate the average.

The plot below depicts the difference to the “true” median in percent. A clear blue means that the median calculated from after discarding the warmup iteration on the y-axis and using only the first n iterations (x-axis) is identical to the “true” median. A clear bright red means the median differs by 5%. If no color is given, the median differs by more than 5%.

Difference from "true" median in % for IntegerLoop

This means, for our IntegerLoop benchmark, we can match the figure easily to our previous figure. Because of how the compilation decided to optimize, we have initially a performance that differs by about 20% from the “true” median and thus, is left white. Around the 1000th iteration, discarding them as warmup, and using only perhaps 100 iterations for the mean, we see some yellow, which means the mean is off by 2.5%. Below 1000 iterations of warmup, but with more than a 100 or so iterations, we see a darker shade of blue, indicating perhaps 1% difference from the mean. The rest is clear blue, indicating that which ever configuration of warmup and iterations we chose would match the true median.

Difference from "true" median in % for Vacation

All Warmup/Iteration Plots

CDmacro-steadySOMns-graal
DeltaBluemacro-steadySOMns-graal
DeltaBlueNSmacro-steadySOMns-graal
GraphSearchmacro-steadySOMns-graal
Havlakmacro-steadySOMns-graal
Jsonmacro-steadySOMns-graal
Leemacro-steadySOMns-graal
LeeSTMmacro-steadySOMns-graal
Mandelbrotmacro-steadySOMns-graal
NBodymacro-steadySOMns-graal
PageRankmacro-steadySOMns-graal
Richardsmacro-steadySOMns-graal
RichardsNSmacro-steadySOMns-graal
Splaymacro-steadySOMns-graal
Vacationmacro-steadySOMns-graal
VacationSTMmacro-steadySOMns-graal
ClosureDefFibonaccimicro-steadySOMns-graal
ClosureFibonaccimicro-steadySOMns-graal
Dispatchmicro-steadySOMns-graal
Fibonaccimicro-steadySOMns-graal
FieldLoopmicro-steadySOMns-graal
IntegerLoopmicro-steadySOMns-graal
Loopmicro-steadySOMns-graal
Recursemicro-steadySOMns-graal
Summicro-steadySOMns-graal
WhileLoopmicro-steadySOMns-graal
AStarSearchsavina-jitSOMns-graal
BankTransactionsavina-jitSOMns-graal
BigContentionsavina-jitSOMns-graal
Chameneossavina-jitSOMns-graal
CigaretteSmokerssavina-jitSOMns-graal
ConcurrentDictionarysavina-jitSOMns-graal
ConcurrentSortedLinkedListsavina-jitSOMns-graal
Countingsavina-jitSOMns-graal
ForkJoinActorCreationsavina-jitSOMns-graal
ForkJoinThroughputsavina-jitSOMns-graal
LogisticsMapSeriessavina-jitSOMns-graal
NQueenssavina-jitSOMns-graal
Philosopherssavina-jitSOMns-graal
PingPongsavina-jitSOMns-graal
ProducerConsumerBoundedBuffersavina-jitSOMns-graal
RadixSortsavina-jitSOMns-graal
SleepingBarbersavina-jitSOMns-graal
ThreadRingsavina-jitSOMns-graal
TrapezoidalApproximationsavina-jitSOMns-graal
UnbalancedCobwebbedTreesavina-jitSOMns-graal

For the Vacation benchmark, the plot is more interesting. For the most parts, we are far off the “true” median. Though, then with increasing number of warmup and used iterations, we get close to it. This establishes a diagonal where the values become gradually more accurate.

Looking at these plots for all benchmarks in the box on the right, we see that for quite a number of them, things look pretty good, and they have a lot of blue. Though, there are also a number, where we clearly see that the compilation threshold is an issue, and we only get closer to the “true” median, after we crossed the threshold. For other benchmarks, there are a lot of parts that do not come close to the “true” median, like for the Vacation benchmark.

Is There A More Stable and Informative Metric?

While the above plots give us some insight in how much warmup and number of iterations are needed to gets close to the “true” median, the precision leaves a bit to be desired. To make these plots readable, I also avoided to add other colors to distinguish between values below and above the “true” median.

Originally, I wanted to use bootstrapping and bootstrap confidence intervals here, which would avoid making unsupported assumptions about the distribution of measurements. Unfortunately, it turned out to be impractical with the large data set at hand. Instead, I’ll look at a few quantiles, the arithmetic mean, and minimum. The minimum has been shown before as a metric that stabilizes fast, though, I assume that it does hide possibly important changes in practice.

To investigate these metrics, I will look at the pairs of warmup and number of iterations that fall roughly on the diagonal of the plots in the last section, indicated by the crosses you may have noticed.

As before, the first plot shows the results for IntegerLoop. The plot focuses on the range of 5% above and 5% below the “true” value assumed for each metric.

Difference from "true" value in % for IntegerLoop using data from all 30 invocations

Because of the difference of the later performance to the first 1000 iterations, we see the metrics converging to the “true” value only after the 1000th iteration. Interestingly, the median start converging first, while the minimum comes in later. This happens here because we first need to reach the point where we discard enough warmup iterations, before the minimum is what we expect it to be based on the last 1000 iterations.

All Difference Plots

CDmacro-steadySOMns-graal
DeltaBluemacro-steadySOMns-graal
DeltaBlueNSmacro-steadySOMns-graal
GraphSearchmacro-steadySOMns-graal
Havlakmacro-steadySOMns-graal
Jsonmacro-steadySOMns-graal
Leemacro-steadySOMns-graal
LeeSTMmacro-steadySOMns-graal
Mandelbrotmacro-steadySOMns-graal
NBodymacro-steadySOMns-graal
PageRankmacro-steadySOMns-graal
Richardsmacro-steadySOMns-graal
RichardsNSmacro-steadySOMns-graal
Splaymacro-steadySOMns-graal
Vacationmacro-steadySOMns-graal
VacationSTMmacro-steadySOMns-graal
ClosureDefFibonaccimicro-steadySOMns-graal
ClosureFibonaccimicro-steadySOMns-graal
Dispatchmicro-steadySOMns-graal
Fibonaccimicro-steadySOMns-graal
FieldLoopmicro-steadySOMns-graal
IntegerLoopmicro-steadySOMns-graal
Loopmicro-steadySOMns-graal
Recursemicro-steadySOMns-graal
Summicro-steadySOMns-graal
WhileLoopmicro-steadySOMns-graal
AStarSearchsavina-jitSOMns-graal
BankTransactionsavina-jitSOMns-graal
BigContentionsavina-jitSOMns-graal
Chameneossavina-jitSOMns-graal
CigaretteSmokerssavina-jitSOMns-graal
ConcurrentDictionarysavina-jitSOMns-graal
ConcurrentSortedLinkedListsavina-jitSOMns-graal
Countingsavina-jitSOMns-graal
ForkJoinActorCreationsavina-jitSOMns-graal
ForkJoinThroughputsavina-jitSOMns-graal
LogisticsMapSeriessavina-jitSOMns-graal
NQueenssavina-jitSOMns-graal
Philosopherssavina-jitSOMns-graal
PingPongsavina-jitSOMns-graal
ProducerConsumerBoundedBuffersavina-jitSOMns-graal
RadixSortsavina-jitSOMns-graal
SleepingBarbersavina-jitSOMns-graal
ThreadRingsavina-jitSOMns-graal
TrapezoidalApproximationsavina-jitSOMns-graal
UnbalancedCobwebbedTreesavina-jitSOMns-graal
Difference from "true" value in % for Vacation using data from all 30 invocations

For the Vacation benchmark, the plot again looks very different. As we would expect from previous plots, the metrics converge very late. Though, here the minimum and the lower quantiles converge before the median and mean.

The other plots are again in a box on the right.

What’s the Largest Differnce from the “True” Value for these Metrics?

The above plots give use an idea of what it looks like when we use the data of 30 invocations. Though, realistically, this takes too long for day-to-day engineering. Below, I am going to look at how the metrics behave when we always pick the invocation with the largest difference from the “true” value.

For the IntegerLoop benchmark, things look fairly familiar. Though, the mean does seem to get worse with more data. The median seems to be the best metric of the lot, but stays well above the indicated 0.1% error margin.

Max. difference of any invocation from "true" median in % for IntegerLoop

All Max. Difference Plots

CDmacro-steadySOMns-graal
DeltaBluemacro-steadySOMns-graal
DeltaBlueNSmacro-steadySOMns-graal
GraphSearchmacro-steadySOMns-graal
Havlakmacro-steadySOMns-graal
Jsonmacro-steadySOMns-graal
Leemacro-steadySOMns-graal
LeeSTMmacro-steadySOMns-graal
Mandelbrotmacro-steadySOMns-graal
NBodymacro-steadySOMns-graal
PageRankmacro-steadySOMns-graal
Richardsmacro-steadySOMns-graal
RichardsNSmacro-steadySOMns-graal
Splaymacro-steadySOMns-graal
Vacationmacro-steadySOMns-graal
VacationSTMmacro-steadySOMns-graal
ClosureDefFibonaccimicro-steadySOMns-graal
ClosureFibonaccimicro-steadySOMns-graal
Dispatchmicro-steadySOMns-graal
Fibonaccimicro-steadySOMns-graal
FieldLoopmicro-steadySOMns-graal
IntegerLoopmicro-steadySOMns-graal
Loopmicro-steadySOMns-graal
Recursemicro-steadySOMns-graal
Summicro-steadySOMns-graal
WhileLoopmicro-steadySOMns-graal
AStarSearchsavina-jitSOMns-graal
BankTransactionsavina-jitSOMns-graal
BigContentionsavina-jitSOMns-graal
Chameneossavina-jitSOMns-graal
CigaretteSmokerssavina-jitSOMns-graal
ConcurrentDictionarysavina-jitSOMns-graal
ConcurrentSortedLinkedListsavina-jitSOMns-graal
Countingsavina-jitSOMns-graal
ForkJoinActorCreationsavina-jitSOMns-graal
ForkJoinThroughputsavina-jitSOMns-graal
LogisticsMapSeriessavina-jitSOMns-graal
NQueenssavina-jitSOMns-graal
Philosopherssavina-jitSOMns-graal
PingPongsavina-jitSOMns-graal
ProducerConsumerBoundedBuffersavina-jitSOMns-graal
RadixSortsavina-jitSOMns-graal
SleepingBarbersavina-jitSOMns-graal
ThreadRingsavina-jitSOMns-graal
TrapezoidalApproximationsavina-jitSOMns-graal
UnbalancedCobwebbedTreesavina-jitSOMns-graal

For the Vacation benchmark, the situation is quite a bit worse. We are getting nowhere near the accuracy that we would need. The error does barely reach 25%.

Difference from "true" value in % for Vacation using data from all 30 invocations

Looking at the plots for all benchmarks in the box on the right, doesn’t make me feel more joyous either. It’s bad.

How Many Benchmarks Reach a Certain Accuracy?

With these plots, the next question coming to my mind is how many benchmarks actually reach a desired accuracy. So, for how many benchmarks is the difference to a “true” average within an allowed range.

The plot below indicates again: it doesn’t look good. The 25th percentile seems to be have reasonably well and I’ll rely on it for the further discussion.

In total, there are 46 benchmarks. If we are optimistic, and allow merely an error of 0.1%, there are only 5 benchmarks that reach the goal using the 25th percentile as metric. With a 1% error, we get up to 12 benchmarks, and with 5% even 27.

Though this means there are a lot of benchmarks, for which the accuracy is not very practical. Even with a 5% fluctuation, I can’t really tell whether a small optimization in the language implementation has an impact on performance.

The number of benchmars that reach a given accuracy goal, i.e., for a metric the benchmark's value is within the allowed difference from the "true" value.

How Much Data Do We Need to Reach a Certain Accuracy?

To conclude the discussion for now, let’s look at how much data we need to reach a desired level of accuracy. Using the 25% percentile as before, I’ll determine the number of iterations that it takes (including warmup) before we reach a given degree of accuracy.

The table below shows the required iterations for a given accuracy. Benchmarks that aren’t listed, simply don’t reach the accuracy. For the ones that are listed, we can see that a lot of data is needed. Of course, this is still assuming we only use a single invocation for the benchmark. Thus, these numbers a derived from the maximum error between “true” value and one of the observed invocations.

Benchmark 0.1% 0.2% 0.3% 0.4% 0.5% 0.6% 0.7% 0.8% 0.9% 1% 1.1% 1.2% 1.3% 1.4% 1.5% 1.6% 1.7% 1.8% 1.9% 2% 2.1% 2.2% 2.3% 2.4% 2.5%
Dispatch 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352
FieldLoop 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352
Mandelbrot 1352 1352 577 577 577 577 577 552 552 552 552 552 552 502 502 352 352 352 352 352 352 352 352 352 352
ProducerConsumerBoundedBuffer 1052 502 477 277 277 277 277 277 227 227 177 77 77 77 77 77 77 77 77 77 77 52 52 52 52
Sum 777 777 777 777 777 777 777 777 302 302 302 302 302 302 302 227 227 227 227 227 202 202 202 202 202
NBody     427 427 427 427 427 427 427 427 427 427 277 277 277 277 277 277 277 277 277 277 252 252 252
TrapezoidalApproximation     127 127 127 127 127 52 52 52 52 52 52 52 52 52 52 52 52 52 52 52 52 52 52
IntegerLoop           1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352 1352
ClosureFibonacci             2652 2602 2402 977 452 452 452 427 427 427 427 427 427 427 427 427 427 427 427
Lee                 1702 1652 1602 1602 1527 1527 1527 1502 1502 1427 1427 1402 1402 1402 1402 1402 1377
ConcurrentSortedLinkedList                   2952 2852 2802 2777 2627 2602 1702 1452 1402 1402 1377 1377 502 452 452 402
NQueens                   2702 2702 2702 2702 2702 2702 2702 2702 2702 2702 2702 2702 2702 2702 2702 2702
CD                         2727 2702 677 677 677 652 652 652 652 652 652 652 652
Recurse                           2502 127 127 127 127 127 127 127 127 102 102 102
Splay                                     3002 2877 2727 2352 2352 2327 2327
RichardsNS                                       877 502 502 502 502 502
WhileLoop                                       1452 1277 1277 852 852 852
Havlak                                             2952 2952 2202

Further Reading

This analysis isn’t done in a vacuum of course. Proper benchmarking and evaluation are an important topic in literature, especially when it comes to reporting results in academic papers. Georges at al. (2007) have long advocated for using statistically rigorous methods for evaluating just-in-time compiled systems. Their proposed methodology includes to take the mean per invocation for the iterations after a certain warmup cut off. Though, they use also confidence intervals and the coefficient of variance, which I haven’t done here. Interestingly they report a maximum difference of 20% between the minimum and maximum for benchmarks at peak performance. The benchmarks I am using on SOMns seem to show higher differences. Georges at al. also provided JavaStat, a tool to automatically determine the necessary number of benchmark iterations to reach a desired confidence interval width.

Kalibera and Jones (2013) expand on the work by investigating how to do rigorous benchmarks in reasonable time. They advocate for the use of effect size confidence intervals as well as a visual approach to determine that measurements are independent of each other. They also look at the variation between invocations, which at least for one benchmark was 30%. This lead them to propose a way to determine how many iterations and invocations one would need to achieve a desired precision. In an earlier paper they went into the underlying technical details.

More recently, Barrett et al. looked into one of the underlying assumptions we typically have: we hope our systems reach a steady state. Spoiler: they often don’t. They use Krun to control the system on which the benchmarks are executed to minimize interference coming from the operating system and hardware. The work concludes with useful suggestions about sizing experiments, similar to what Kalibera and Jones did. They suggest to use at the very least 10 invocations, where 15 should give reliable results.

Bulej et al. (2020) investigate how to benchmark in noisy environments such as multitenant cloud servers. They propose an approach that carefully synchronizes the execution of the pairs of systems to be compared. This seems to reduce the size of their bootstrap confidence intervals significantly.

What’s Next?

The aforementioned literature lays a foundation for statistically rigorous work. I have been adhering to at least some of the suggestions to ensure reliable results in my own papers. However, none of these approaches help me with my day-to-day engineering needs. Even doing 5 or 10 invocations for each benchmark result in overall run times of the benchmarking job that make them too slow for good usability.

From the data it is evident that the compilation threshold is clearly observable. The first step will be to reduce it and hope that this will lead to better results. A very quick and dirty run with the benchmarking settings I have been using didn’t indicate any significant improvements, but a full run is still in progress.

Another knob to be turned are the benchmark parameters. Benchmarks such as Vacation have very short iteration times, which makes them susceptible to noise.

And, then, there are the strange things, for instance with IntegerLoop, which need to be investigated, and possibly reveal engineering issues.

Since SOMns uses Truffle and Graal, using libgraal may further improve things. It would hopefully avoid interference from the compiler being compiled at run time. Similarly, perhaps there are a few simple things to be stolen from Krun that could reduce the environmental noise.

And there are probably, a hundred other things I could try.

If you got suggestions, tips and tricks, comments, questions, or complains, you can find me over on Twitter @smarr.

Older Posts

Subscribe via RSS