I am working at the School of Computing
at the University of Kent.
My work focuses on programming language implementation techniques
and concurrent and parallel programming.

Currently, I am working on combining concurrency models in a safe manner
and to enable developers to make sense of complex concurrent programs
with appropriate tools.

Last year, I was asked to give a talk for the Meta’19 workshop.
It’s a workshop on metaprogramming and reflection.
The submission deadline for this year’s edition, is less than a month away: Check it out!

With my interest in making run-time metaprogramming fast,
I thought it might be worthwhile to explore how techniques from the early days of just-in-time compiling virtual machines are still the key elements of optimizing modern language features.

Very naively, I looked at the relevant papers, and when they were published, and picked some of the top-10 movies of the year. I thought, somehow, there should be a story line.
I wrote the abstract (below) and sent it to the organizers.
They seemed to like it. Good.

A few month later, a few weeks before the actual workshop,
I found myself looking at the abstract: damn, what was I thinking?

After bending and stretching the metaphors well beyond the breaking point, I ended up with a talk, that may or may not make sense.
Ignoring the poor metaphors,
it sketches how polymorphic inline caches (WP) and
maps
(also know as object shapes or hidden classes)
can be combined to optimize run-time reflection, metaobject protocols, and gradual typing.

The interesting insight here is that a map/hidden class can be used
to encode a wide range of properties about a group of objects,
which enables compilers to optimize and generate efficient code.

Abstract

Metaobject Protocols and Type Checks, do they have much in common? Perhaps not from a language perspective. However, under the hood of a modern virtual machine, they turn out to show very similar behavior and can be optimized very similarly.

This talk will go back to the days of Terminator 2, The Naked Gun 2 1/2, and Star Trek VI. We will revisit the early days of just-in-time compilation, the basic insights that are still true, and see how to apply them to metaprogramming techniques of different shapes and forms.

If you have any questions, I am more than happy to answer, possibly on Twitter @smarr.

Do my performance measurements allow me to conclude anything at all?

For the work on various research languages (SOMns, SOM, TruffleSOM …), I have
been using continuous performance tracking for the better part of the last decade.

One would hope that I would know what I am doing…

Since about half a year, I am using ReBenchDB to track performance (before I used Codespeed),
and use the same statistical tools, that I am using for academic research
papers. While ReBenchDB is of great help, and built for exactly what I want to do,
it made me question more strongly than before whether I can even trust these
benchmark results.

The screenshot shows a table with the results of an experiment
for which I would like to know whether it changes the performance
of my language implementation.
I picked a section from the full report showing microbenchmarks, since I would expect
them to be the simplest to explain, and for this particular experiment,
I wouldn’t expect any changes.

On first inspection, the results seem however mixed.
We got two benchmarks marked as red and three as green,
which may indicate slowdowns and speedups.
Though, a closer look at the boxplots indicates that we can’t really draw
any strong conclusions. The boxes pretty much overlap, and my experience tells
me that this may as well just be noise. I have seen similar ups and downs
rather frequently on these benchmarks.

This means, I can’t tell with confidence whether the change
I tried to benchmark has any kind of positive or negative impact on
performance. Though, I am somewhat sure that nothing went terribly wrong.
I guess that’s at least something.

One option may perhaps be to let the benchmarks run for longer.
Currently, it takes about 40min for the benchmark setup to run.
If the reliability of the results could be increased,
perhaps running the benchmarks for an hour may still be good enough.
Much longer run times will just mean that benchmarking will be ignored,
so, there’s really an upper limit on how long things can take.
Another option might be adding more benchmarking machines, though,
currently, there’s no budget for that either.

So, what would it take to improve the reliability of my benchmarks
and get numbers that I can trust more readily?

Gathering More Data to Understand What’s Going On

The benchmarking setup I am using relies on ReBench. This means,
in the ideal case, I can simply ask ReBench to execute every benchmark for 3000
iterations and repeat this 30. Since I know that compilation triggers after
1000 iterations, this should give us fully warmed up execution.
Having everything executed 30 times, i.e., using 30 invocations should make sure that
any variation across different invocations can also be determined reliably.

This should be all we need to change about the setup:

Not something that’s practical for everyday development…

Get a Bit of an Overview of All the Data

Having the data, let’s try to get an impression of the overall situation.
A line chart that shows the iterations on the x-axis and the run time of
an iteration in milliseconds on the y-axis should do the job.

Though, we got 30 invocations for every benchmark. To see how these invocations
differ, I’ll include all but make the lines transparent. This should allow
us to see differences in behavior.

The first plot below shows the IntegerLoop
benchmark:

public benchmark = ( | bounds a |
bounds:: 20000.
bounds negated to: bounds by: 1 do: [:value | a:: value - value].
^ a
)

The above code is a loop from -20000 to 20000 that sets the variable a
to the result of subtracting the current loop index from itself.
It’s a classic microbenchmark
and I wouldn’t expect too many complications from it.

Though, the plot shows a few interesting bits. First note that there is a
dark blue line coming down and then settling at about 25ms.
At iteration 1000, we can see a clear spike. At this point, the
benchmark harness itself crossed the compilation threshold and got compiled.
Before, only our benchmark method from above was compiled,
since it was executed in an additional loop in the harness.
Unfortunately, by compiling the harness as well, performance seems to
decrease a bit to about 30ms, i.e. a 20% increase in run time.
Since compilers rely on heuristics,
these things can happen.
Though, it would be good to look into why it happens here.

But back to the plot itself. We also see light blue spikes,
in different shades of blue, and different sizes. These spikes likely
correspond to garbage collection or compilation events.
Though, overall, the result is pretty stable. Except for the undesirable slowdown,
we would hope that all benchmarks look like this, since all thirty invocations would
report the same result.

The next plot below shows the
Vacation benchmark,
which is a simulation of a travel booking system that has been used for instance to
benchmark software transactional memory systems.
The version used here, relies on threads and locks, even though, we run it only
with a single thread.

The plot is clearly less nice than the one for IntegerLoop.
Only after about a 100 iterations, we reach a point where the measurements fit onto the plot.
This shows that the compiler can optimize the benchmark very well, but it takes time to do so.
Eventually, the benchmark reaches a run time of under 2ms, which isn’t ideal,
since we might simply be measuring noise at this level of performance.

We also see that there is a much larger area that is different shades of blue.
Though, there seems to be a blue core, which indicates that there is a
somewhat normal-ish distribution of iteration times across our 30 invocations
of the benchmark.

Just to emphasize what is visualized, I picked out one invocation,
and instead of plotting it in a transparent blue, it’s plotted in red.
With it we can see the ups and downs within a single
invocation, but the overall trend seems to stabilize.

The plots for all benchmarks are shown in the box on the right.

How Close Can We Get to the “True” Average Performance?

After inspecting our plots, and considering the compilation threshold of 1000
iterations, it seems plausible to define the “true” average for our benchmarks
based on the measurements after the 2000th iteration.
In the previous plots, you may have noticed the red dashed line at the 2000th
iteration. This definition gives us 1000 iterations to calculate the average
from. As a somewhat stable average, I’ll use the median for now.
Let’s assume for simplicity that this is fine.

Since the time benchmarking takes needs to be practical,
I need to to minimize this warmup threshold as well as the number of iterations
used to calculate the median, while keeping it within an acceptable range of
the “true” median I calculated from the last 1000 iterations.

To get a feeling of whether this is workable, I’ll visualize how far off
the median is when I take various different warmup thresholds and
number of iterations to calculate the average.

The plot below depicts the difference to the “true” median in percent.
A clear blue means that the median calculated from after discarding the
warmup iteration on the y-axis and using only the first n iterations (x-axis)
is identical to the “true” median. A clear bright red means the median differs
by 5%. If no color is given, the median differs by more than 5%.

This means, for our IntegerLoop benchmark, we can match the figure easily to
our previous figure. Because of how the compilation decided to optimize,
we have initially a performance that differs by about 20% from the “true” median
and thus, is left white.
Around the 1000th iteration, discarding them as warmup,
and using only perhaps 100 iterations for the mean, we see some yellow, which
means the mean is off by 2.5%. Below 1000 iterations of warmup, but with more
than a 100 or so iterations, we see a darker shade of blue, indicating perhaps
1% difference from the mean. The rest is clear blue, indicating that which ever
configuration of warmup and iterations we chose would match the true median.

For the Vacation benchmark, the plot is more interesting.
For the most parts, we are far off the “true” median. Though, then with
increasing number of warmup and used iterations, we get close to it.
This establishes a diagonal where the values become gradually more accurate.

Looking at these plots for all benchmarks in the box on the right,
we see that for quite a number of them,
things look pretty good, and they have a lot of blue. Though, there are also
a number, where we clearly see that the compilation threshold is an issue,
and we only get closer to the “true” median, after we crossed the threshold.
For other benchmarks, there are a lot of parts that do not come close to the
“true” median, like for the Vacation benchmark.

Is There A More Stable and Informative Metric?

While the above plots give us some insight in how much warmup and number of
iterations are needed to gets close to the “true” median,
the precision leaves a bit to be desired.
To make these plots readable, I also avoided to add other colors to distinguish
between values below and above the “true” median.

Originally, I wanted to use bootstrapping
and bootstrap confidence intervals here,
which would avoid making unsupported assumptions about the distribution of
measurements.
Unfortunately, it turned out to be impractical with the large data set at hand.
Instead, I’ll look at a few quantiles, the arithmetic mean, and minimum.
The minimum has been shown before as a metric that stabilizes fast, though,
I assume that it does hide possibly important changes in practice.

To investigate these metrics,
I will look at the pairs of warmup and number of iterations that fall roughly
on the diagonal of the plots in the last section, indicated by the crosses
you may have noticed.

As before, the first plot shows the results for IntegerLoop.
The plot focuses on the range of 5% above and 5% below the “true” value assumed
for each metric.

Because of the difference of the later performance to the first 1000 iterations,
we see the metrics converging to the “true” value only after the 1000th iteration.
Interestingly, the median start converging first, while the minimum comes in later.
This happens here because we first need to reach the point where we discard
enough warmup iterations, before the minimum is what we expect it to be
based on the last 1000 iterations.

For the Vacation benchmark, the plot again looks very different.
As we would expect from previous plots, the metrics converge very late.
Though, here the minimum and the lower quantiles converge before
the median and mean.

The other plots are again in a box on the right.

What’s the Largest Differnce from the “True” Value for these Metrics?

The above plots give use an idea of what it looks like
when we use the data of 30 invocations.
Though, realistically, this takes too long for day-to-day engineering.
Below, I am going to look at how the metrics behave when we always pick
the invocation with the largest difference from
the “true” value.

For the IntegerLoop benchmark, things look fairly familiar.
Though, the mean does seem to get worse with more data.
The median seems to be the best metric of the lot,
but stays well above the indicated 0.1% error margin.

For the Vacation benchmark, the situation is quite a bit worse.
We are getting nowhere near the accuracy that we would need.
The error does barely reach 25%.

Looking at the plots for all benchmarks in the box on the right, doesn’t make me feel more joyous either.
It’s bad.

How Many Benchmarks Reach a Certain Accuracy?

With these plots, the next question coming to my mind is how many benchmarks
actually reach a desired accuracy. So, for how many benchmarks is the difference
to a “true” average within an allowed range.

The plot below indicates again: it doesn’t look good.
The 25th percentile seems to be have reasonably well
and I’ll rely on it for the further discussion.

In total, there are 46 benchmarks.
If we are optimistic, and allow merely an error of 0.1%,
there are only 5
benchmarks that reach the goal using the 25th percentile as metric.
With a 1% error, we get up to 12 benchmarks, and with 5% even 27.

Though this means there are a lot of benchmarks, for which the accuracy
is not very practical. Even with a 5% fluctuation, I can’t really tell whether
a small optimization in the language implementation has an impact on
performance.

How Much Data Do We Need to Reach a Certain Accuracy?

To conclude the discussion for now, let’s look at how much data we need to reach
a desired level of accuracy.
Using the 25% percentile as before, I’ll determine the number of iterations
that it takes (including warmup) before we reach a given degree of accuracy.

The table below shows the required iterations for a given accuracy.
Benchmarks that aren’t listed, simply don’t reach the accuracy.
For the ones that are listed, we can see that a lot of data is needed.
Of course, this is still assuming we only use a single invocation for the benchmark.
Thus, these numbers a derived from the maximum error between “true” value and
one of the observed invocations.

Benchmark

0.1%

0.2%

0.3%

0.4%

0.5%

0.6%

0.7%

0.8%

0.9%

1%

1.1%

1.2%

1.3%

1.4%

1.5%

1.6%

1.7%

1.8%

1.9%

2%

2.1%

2.2%

2.3%

2.4%

2.5%

Dispatch

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

FieldLoop

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

Mandelbrot

1352

1352

577

577

577

577

577

552

552

552

552

552

552

502

502

352

352

352

352

352

352

352

352

352

352

ProducerConsumerBoundedBuffer

1052

502

477

277

277

277

277

277

227

227

177

77

77

77

77

77

77

77

77

77

77

52

52

52

52

Sum

777

777

777

777

777

777

777

777

302

302

302

302

302

302

302

227

227

227

227

227

202

202

202

202

202

NBody

427

427

427

427

427

427

427

427

427

427

277

277

277

277

277

277

277

277

277

277

252

252

252

TrapezoidalApproximation

127

127

127

127

127

52

52

52

52

52

52

52

52

52

52

52

52

52

52

52

52

52

52

IntegerLoop

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

1352

ClosureFibonacci

2652

2602

2402

977

452

452

452

427

427

427

427

427

427

427

427

427

427

427

427

Lee

1702

1652

1602

1602

1527

1527

1527

1502

1502

1427

1427

1402

1402

1402

1402

1402

1377

ConcurrentSortedLinkedList

2952

2852

2802

2777

2627

2602

1702

1452

1402

1402

1377

1377

502

452

452

402

NQueens

2702

2702

2702

2702

2702

2702

2702

2702

2702

2702

2702

2702

2702

2702

2702

2702

CD

2727

2702

677

677

677

652

652

652

652

652

652

652

652

Recurse

2502

127

127

127

127

127

127

127

127

102

102

102

Splay

3002

2877

2727

2352

2352

2327

2327

RichardsNS

877

502

502

502

502

502

WhileLoop

1452

1277

1277

852

852

852

Havlak

2952

2952

2202

Further Reading

This analysis isn’t done in a vacuum of course.
Proper benchmarking and evaluation are an important topic in literature,
especially when it comes to reporting results in academic papers.
Georges at al. (2007)
have long advocated for using statistically rigorous methods for evaluating
just-in-time compiled systems. Their proposed methodology includes to take the
mean per invocation for the iterations after a certain warmup cut off.
Though, they use also confidence intervals and the coefficient of variance,
which I haven’t done here.
Interestingly they report a maximum difference of 20% between the
minimum and maximum for benchmarks at peak performance.
The benchmarks I am using on SOMns seem to show higher differences.
Georges at al. also provided JavaStat, a tool to automatically determine the necessary
number of benchmark iterations to reach a desired confidence interval
width.

Kalibera and Jones (2013)
expand on the work by investigating how to do rigorous benchmarks in reasonable time.
They advocate for the use of effect size confidence intervals as well as a visual approach
to determine that measurements are independent of each other.
They also look at the variation between invocations, which at least for one
benchmark was 30%. This lead them to propose a way to determine
how many iterations and invocations one would need to achieve a desired precision.
In an earlier paper
they went into the underlying technical details.

More recently, Barrett et al.
looked into one of the underlying assumptions we typically have: we hope our
systems reach a steady state. Spoiler: they often don’t.
They use Krun to control the
system on which the benchmarks are executed to minimize interference coming
from the operating system and hardware.
The work concludes with useful suggestions about sizing experiments, similar
to what Kalibera and Jones did.
They suggest to use at the very least 10 invocations, where 15 should give reliable results.

Bulej et al. (2020) investigate how to benchmark
in noisy environments such as multitenant cloud servers.
They propose an approach that carefully synchronizes the execution of the pairs
of systems to be compared. This seems to reduce the size of their bootstrap
confidence intervals significantly.

What’s Next?

The aforementioned literature lays a foundation for statistically rigorous work.
I have been adhering to at least some of the suggestions to ensure reliable
results in my own papers. However, none of these approaches help me with my
day-to-day engineering needs. Even doing 5 or 10 invocations for each benchmark
result in overall run times of the benchmarking job that make them too slow for
good usability.

From the data it is evident that the compilation threshold is clearly observable.
The first step will be to reduce it and hope that this will lead to better results.
A very quick and dirty run with the benchmarking settings I have been using
didn’t indicate any significant improvements, but a full run is still in progress.

Another knob to be turned are the benchmark parameters. Benchmarks such as
Vacation have very short iteration times, which makes them susceptible to noise.

And, then, there are the strange things, for instance with IntegerLoop,
which need to be investigated, and possibly reveal engineering issues.

Since SOMns uses Truffle and Graal, using libgraal may further improve things. It would hopefully avoid interference from
the compiler being compiled at run time. Similarly, perhaps there are a few
simple things to be stolen from Krun that could reduce the environmental noise.

And there are probably, a hundred other things I could try.

If you got suggestions, tips and tricks, comments, questions, or complains,
you can find me over on Twitter @smarr.

Last September, I had a lot of fun putting together a lecture on
language implementation techniques. It is something I wanted to do for a while,
but I had not had a good excuse before to actually do it.

When I got asked to give this lecture at a Dagstuhl summer school,
I posted an outline on Twitter, and as one might expect, some people raised
concerns about things that are missing. And, indeed, it’s far from complete,
and biased by my own experience, background, and research.
Though, perhaps it is still useful for others.

Abstract

Dynamic languages leave all hard problems to the runtime system. Some argue
this allows programmers to focus on the application they implement, but it also
means that dynamic language implementations have to learn at run time what a
program does so that they can execute it efficiently.

This lecture will give a brief introduction into implementation techniques,
starting from abstract-syntax-tree and bytecode interpreters, and then going to
modern just-in-time compilation approaches based on partial evaluation or
meta-tracing. We will review ideas such as inline caching, hidden classes,
and storage strategies to understand better how dynamic languages can reach the
performance of less dynamic languages such as Java or C.

Optimizations such as storage strategies unfortunately have a major impact on
thread safety for dynamic languages such as Ruby and Python, which use shared
memory multi-threading. To ensure that we can implement such languages with
safe and efficient parallelism, we will also review variations for classic
storage strategies and object models for parallel virtual machines.

Last but not least, since the techniques are all about performance, we have to
discuss how to effectively assess optimizations. Modern virtual machines and
hardware systems are far from the deterministic machines we expect, which means
we have to take extra care when measuring and reporting performance numbers.

The final agenda for the lecture included:

interpretation

abstract syntax trees

bytecodes

optimizations at the interpreter level

caching, and lookup caching in particular

bytecode quickening

self-optimizing AST interpreters

just-in-time compilation

basic ideas of compilation at run time

metacompilation techniques based on meta-tracing and partial evaluation

efficient data representation

maps, hidden classes, shapes

storage strategies

data representations for multithreaded VMs

concurrent shapes

concurrent strategies

benchmarking

pitfalls

recommendations

This means there are indeed very many things, I haven’t been talking about.
This includes fundamental ideas such as garbage collection, compiler optimizations
and their intermediate representations, tiered compilation,
tooling for debugging & profiling, and metacircularity.
Perhaps, I get to teach a full course at some point, and can include them.

Until then, the following slides might be useful to others.

If you have any questions, I am more than happy to answer, possibly on Twitter @smarr.