Currently, I am working on fast dynamic language implementations,
on combining concurrency models in a safe manner,
automatically preventing concurrency issues from manifesting,
and to enable developers to make sense of complex concurrent programs
with appropriate tools.
An Implementer’s Perspective on High-level Concurrency Models, Debugging Tools, and the Future of Automatic Bug Mitigation
The actor model is a great tool for various use cases. Though, it’s not the
only tool, and sometimes perhaps not even the best. Consequently, developers
started mixing and matching high-level concurrency models based on the problem
at hand, much like other programming abstractions. Though, this comes with
various problems. For instance, we don’t usually have debugging tools that help
us to make sense of the resulting system. If we even have a debugger, it may
barely allow us to step through our programs instruction by instruction.
Let’s imagine a better world! One were we can follow asynchronous messages,
jump to the next transaction commit, or break on the next fork/join task
created. Though, race conditions remain notoriously difficult to reproduce. One
solutions it to record our program’s execution, ideally capturing the bug. Then
we can replay it as often as need to identify the cause of our bug.
The hard bit here is making record & replay practical.
I will explain how our concurrency-model-agnostic approach allows us
to record model interactions trivially for later replay,
and how we minimized its run-time overhead.
In the case of actor applications, we can even make the snapshotting fast
to be able to limit trace sizes.
Having better debugging capabilities is a real productivity boost.
Though, some bugs will always slip through the cracks.
So, what if we could prevent those bugs from causing issues?
Other researchers have shown how to do it, and I’ll conclude this talk
with some ideas on how we can utilize the knowledge we have in our
language implementations to make such mitigation approaches fast.
The talk is based on work done in collaboration with
Dominik Aumayr, Carmen Torres Lopez, Elisa Gonzalez Boix, and Hanspeter Mössenböck.
I’d also like to thank the AGERE!’21 organizers to invited me.
I enjoyed preparing the talk in some sense as a retrospective of the work
we did in the MetaConc Project.
For a more complete list of things and papers published in the wider context
of the project, please head over to its website.
If you have questions, please feel free to reach out via Twitter.
This summer, I talked to a number of groups from the community on
how they do benchmarking for instance as part of their day-to-day engineering
and for the evaluation of ideas for research papers.
It talked with them about their general approach and the tools they use.
This showed a wide variety of approaches, opinions, preferences, and tools.
Though, it also showed me the there remains a lot of work to be done
to get best practices adopted, and perhaps even build better tools,
research better ways of benchmarking, and adopt reliable means for
data processing and data analysis.
At the Virtual Machine Meetup (VMM’21),
I report a bit on my general impressions, recounted the issues people
talked about, their solutions or desire for solutions, the good practices
observed, as well as ideas for improvements. There is a lot of diversity
in the approaches used in the community. Some of the
diversity comes from the wide range of research questions people are interested
in, but a significant amount seems to be caused by the effort and expertise
required for benchmarking, which is arguably very high.
Before looking at the slides, please note that what we have here is based on
on a small set of interviews, using a semi-structured approach.
This means, the discussions were open-ended, and I did not ask exactly the
same questions. Furthermore, much of the data is based on inference from these
This being said, there are a couple of points that seemed to be good practices
perhaps worth advocating for.
Use Automated Testing/Continuous Integration
Correctness comes before performance. While not every project may justify
a huge investment into testing infrastructure, I know that everything that I
do not test is broken. Thus, at the very least, we need to make sure that our
benchmarks compute the expected results.
Use Same Setup for Day-to-Day Engineering as for Benchmarks used in Papers
To often benchmarks are the last thing in the process, and the week before
the deadline there’s a sense of panic. Benchmarking is not impossibly hard,
but also far from trivial. A good and reliable setup takes time, start from day 1,
use it while building your system/experiment,
and then have it ready and tested when you want results for the paper.
Most Continuous Integration Systems will Manage Artifacts
Keeping track of benchmark results is hard. If you’re using CI (see point 1),
you could probably use it to store benchmark results as well.
This means, you automatically keep track of at least a part of the relevant bits
of information needed to figure out what the data means when trying to analyse it.
Automate Data Handling
Copying data around manually makes it too easy to make mistakes.
For instance when using spreadsheets, don’t copy data around manually.
Instead, try to use the spreadsheet’s system for data import,
eliminating one source of easy mistakes.
Of course, having things automated, likely means that rerunning and analyzing
results after a bug fix or last minute change becomes much easier.
Define Workflow that Works for Your Group
Too often the knowledge of how to do benchmarking and performance evaluation
can be in the head of a PhD student, who may leave after finishing.
Instead of letting the knowledge leave with them, it’s worthwhile to actively
start teaching how to run good benchmarks.
It also makes the life of new team members much easier…
These are just some quick thoughts after giving my talk.
If you have questions, feel free to reach out via Twitter.
Here at Kent, we have a large group of researchers working on Programming Languages and Systems (PLAS), and within this group, we have a small team
focusing on research on interpreters, compilation, and tooling to make programming easier.
It’s summer 2021, and I felt it’s time for a small inventory of the things we are up to. At this very moment, the team consists of Sophie, Octave, and myself.
I’ll include Dominik and Carmen, as well, for bragging rights. Though, they are
either finishing the PhD or just recently defended it.
However, techniques such as inlining and lookup caches have limitations.
While one could see inlining as a way to give extra top-down context
to a compilation,
it’s inherently limited because of the overhead of run-time compilation
and excessive code generation.
To bring top-down context more explicitly into these systems, Sophie explores
the notion of execution phases. Programs often do different things one after another, perhaps first loading data, then processing it, and finally generating output. Our goal is to utilize these phases, to help compilers produce better code for each of them.
To give just one example, here a bit from one of Sophie’s ICOOOLPS slides:
The green line there is the microbenchmark with phase-based splitting enabled,
giving a nice speedup for the second and fourth phase,
benefitting from monomorphization and only compiling whats important
for the phase, and not for the whole program.
Sophie has already shown that there is potential, and the discussions at ICOOOLPS lead to a number of new ideas for experiments, but it’s still early days.
Generating Benchmarks to Avoid the Tiny-Benchmark Trap
In some earlier blog posts
and a talk at MoreVMs’21 (recording), I argued that we need better benchmarks for our language implementations.
It’s a well known issue that the benchmarks that academia uses for research
are rarely a good representation of what real systems do. Often, simply because they are tiny.
Thus, I want to be able to monitor a system in production and then generate benchmarks that can be freely
shared with other researchers from the behavior that we saw.
Octave is currently working on such a system to generate benchmarks from
abstract structural and behavioral data about an application.
It’s a long way, and Octave currently instruments Java applications, record what they do at run time,
and generate benchmark with similar behavior from that.
Given that Java isn’t the smallest language, there’s a lot to be done,
but I hope we’ll have a first idea of whether this could work by the end of the
Reproducing Nondeterminism in Multiparadigm Concurrent Programs
Dominik is currently writing up his dissertation
on reproducing nondeterminism.
His work is essentially all around tracing and record & replay of concurrent systems.
We wrote a number of papers [1, 2, 3]
which developed efficient ways of doing this first just for actor programs,
and then for various other high-level concurrency models.
The end result allows us to record & replay programs that combine various
concurrency models, with very low overhead.
This is the kind of technology that is needed to reliably debug concurrency
issues, and perhaps in the future even allow for automatic mitigation!
Advanced Debugging Techniques to Handle Concurrency Bugs in Actor-based Applications
Preventing Concurrency Issues, Automatically, At Run Time
While all these projects are very dear to my heart, there’s one, I’d really
love to make more progress on as well: automatically preventing concurrency issues
from causing harm.
We are looking for someone to join our team!
If you are interested in programming language implementation
and concurrency, please reach out!
We have a two-year postdoc position here at Kent in the PLAS group,
and you would join Sophie, Octave, and me to work on interesting research.
In the project, we’ll continue to collaborate also with Prof. Gonzalez Boix and her DisCo research group in Brussels (Belgium), Prof. Mössenböck in Linz (Austria),
and the GraalVM team of Oracle Labs,
which includes the opportunity for research visits.
Our team is well connect for instance also with Shopify,
which supports a project on improving warmup and interpreter performance of