Disclaimer: The artifact, for which I put this automation together, was rejected. I take this as a reminder that the technical bits still require good documentation to be useful.

In the programming language community, as well as in other research communities, we strive to follow scientific principles. One of them is that others should be able to verify the results that we report. One way of enabling verification of our results is by making all, or at least most elements of our systems available. Such an artifact can then be used for instance to rerun benchmarks, experiment with the system, or even build on top of it, and solve entirely different research questions.

Unfortunately, it can be time consuming to put such artifacts together. And, the way we do artifact evaluation does not necessarily help with it: You submit your research paper, and then at some later point get to know whether it is accepted or not. And only if it is accepted, we start preparing the artifacts. Because it can be a lot of work and the deadlines are tight, the result may be less then perfect, which is rather unfortunate.

So, how can we reduce the time it takes to create artifacts?

1. More Automation

For a long time now, I have worked on gradually automating the various elements in my benchmarking and paper writing setup. It started early on with ReBench (docs), a tool to define benchmarking experiments. The goal was to enable others and myself to reexecute experiments with the same parameters and build setup. However, in the context of an artifact, this is only one element.

Perhaps more importantly, with an artifact we want to ensure that others do not run into any kind of issues during the setup of the experiments, avoiding problems with unavailable software dependencies, version conflicts, and the usual mess of our software ecosystems.

One way of going about avoiding these issues is to setup the whole experiment in a systems virtual machine. This means, all software dependencies are included and someone using the artifact will need only the software that can execute the virtual machine image.

VirtualBox is one popular open source solution for these kind of systems virtual machines. Unfortunately, setting up a virtual machine for an artifact is time consuming.

Let’s see how we can automate it.

2. Making Artifacts Part of Continuous Integration

2.1 Packer: Creating a VirtualBox with a Script

Initially, I started using Vagrant, which allows us to script the “provisioning” of virtual machine images. This means, we can use it to install the necessary software for our benchmarking setup in a VirtualBox image. Vagrant also supports systems such as Docker and VMWare, but I’ll stick to VirtualBox for now.

Unfortunately, my attempt of using Vagrant was less than successful. While I was able to generate an image with all the software needed for my benchmarks, when testing the image, it would not correctly boot. Might have been me, or some fluke with the Vagrant VirtualBox image repository.

Inspired by a similar post on creating artifacts, I looked into packer.io, which allows us to create a full VirtualBox image from scratch. Thus, we have full control of what ends up in an artifact, and script the process in a way that can be run as part of CI. Having a fully automated setup, I can create an artifact on our local GitLab CI Runner, either as part of the normal CI process or perhaps weekly, because it takes about 2h to build a VM image.

As a small optimization, I split the creation of the image into two steps. The first step creates a base image with a minimal Lubuntu installation, which can be used as a common base for different artifacts. The second step creates the concrete artifact by executing shell scripts inside the VM, which install dependencies and build all experiments so that the VM image is ready for development or benchmarking.

2.2 Fully Automated Benchmarking as Stepping Stone

Before going into setting up the VM image, we need some basics.

My general setup relies on two elements: a GitLab CI runner, and an automated benchmarking setup.

The fully automated benchmarking setup is useful in its own right. We have used it successfully for many of our research projects. It executes a set of benchmarks for every change pushed to our repository.

Since I am using ReBench for this, running benchmarks on the CI system is nothing more than executing the already configured set of benchmarks:

rebench benchmark.conf ci-benchmarks

For convenience, the results are reported to a Codespeed instance, where one can see the impact of any changes on performance.

Since ReBench also builds the experiments before running, we are already half way to a useful artifact.

2.3 Putting the Artifact Together

Since we could take any existing VirtualBox image as a starting point, let’s start with preparing the artifact, before looking at how I create my base image.

In my CI setup, creating the VirtualBox image boils down to:

packer build artifact.json
mv artifact/ ~/artifacts  # to make the artifact accessible

The key here is of course the artifact.json file, which describes where the base image is, and what to do with it to turn it into the artifact.

The following is an abbreviated version of what I am using to create an artifact for SOMns:

"builders" : [ {
  "type": "virtualbox-ovf",
  "format": "ova",
  "source_path": "base-image.ova",
} ],
"provisioners": [ {
    "echo 'artifact' | {{.Vars}} sudo -S -E bash -eux '{{.Path}}'",
  "scripts": [
  "type": "shell"

In the actual file, there is a bit more going on, but the key idea is that we take an existing VirtualBox image, boot it, and run a number of shell scripts in it.

These shell scripts do the main work. For a typical paper of mine, they would roughly:

  1. configure package repositories, e.g. Node.js and R
  2. install packages, e.g., Python, Java, LaTeX
  3. checkout the latest version of the experiment repo
  4. run ReBench to build everything and execute a benchmark to see that it works. I do this with rebench --setup-only benchmark.conf
  5. copy the evaluation parts of the paper repository into the VM
  6. build the evaluation plots of the paper with KnitR, R, and LaTeX
  7. link the useful directories, README files, and others on the desktop
  8. and for the final touch, set a project specific background file.

A partial script looks perhaps something like the following:

wget -O- https://deb.nodesource.com/setup_8.x | bash -

apt-get update
apt-get install -y openjdk-8-jdk openjdk-8-source \
                   python-pip ant nodejs

pip install git+https://github.com/smarr/ReBench

git clone ${GIT_REPO} ${REPO_NAME}

git checkout ${COMMIT_SHA}
git submodule update --init --recursive
rebench --setup-only ${REBENCH_CONF} SOMns

It configures the Node.js apt repositories and then installs the dependencies. Afterwards, it clones the project, checks it out, and runs the benchmarks. There are a few more things to be done, as can be seen for instance with the SOMns artifact.

This gives us a basic artifact that can be rebuilt whenever needed. It can of course also be adapted to fit new projects easily.

The overall image is usually between 4-6GB in size, and the build process, including the minimal benchmark run takes about 2h. Afterwards, we have a tested artifact.

What remains is writing a good introduction and overview, so that others may use it, verify the results, and may even be able to regenerate the plots in the paper with their own data.

3. Creating a Base Image

As mentioned before, we can use any VirtualBox image as a base image. We might already have one from previous artifacts, and now simply want to increase automation, or we use one of the images offered by the community. We can also build one specifically for our purpose.

For artifacts size matters. Having huge VM images makes downloads slow, storage difficult, and requires users to have sufficient free disk space. Therefore, we may want to ensure that the image only contains what we need.

With packer, we can automate the image creation including the initial installation of the operating system, which gives us the freedom we need. The packer community provides various examples that are a useful foundation for custom images. Inspired by bento and an Idris artifact, I put together scripts for my own base images. These script download a Ubuntu server installation disk, create a VirtualBox, and then start installation. An example configuration is artifact-base-1604.json, which creates a Lubuntu 16.04 core VM. The configuration sets various details including memory size, number of cores, hostname, username, password, etc. Perhaps worthwhile to highlight are the following two settings:

"hard_drive_nonrotational": true,
"hard_drive_discard": true,

This instructions VirtualBox to create the hard drive as an SSD. This hopefully ensures that the disk only uses the actual required space, and therefore minimizes the size of our artifacts. Though, I am not entirely sure this is without drawbacks. But so far, it seems that disk zeroing and other tricks used to reduce the size of VM images is not necessary with these settings.

In addition to the artifact-base-1604.json file, the preseed.cfg instructs the Ubuntu installer to configure the system, installs useful packages such as an SSH server, a minimal Lubuntu core systems, Firefox, a PDF viewer, and a few other things. After these are successfully installed, the basic-setup.sh configures the system to disable automatic updates, configure the SSH server for use in a VM, enable password-less sudo, and install the VirtualBox guest support.

The result is packaged up as a *.ova file which can be directly loaded by VirtualBox and becomes the base image for my artifacts.

4. What’s Next?

With this setup, we automate the recurring and time consuming tasks of creating VirtualBox images that contain our artifacts. In my case, such artifacts contain the sources, benchmarking infrastructure, as well as the scripts to recreate all numbers and plots in the paper.

That means, the artifact misses documentation of how the pieces can be used, how SOMns is implemented, and what one would need to do to change things. Thus, for the next artifact, I hope having this automation will allow me to focus on writing better documentation instead of putting together all the bits and pieces manually. For new projects, I can hopefully reuse this infrastructure, and get the artifact created by the CI server from day 1. Whether it actually works, I’ll hopefully see soon.

Docker would be also worth looking into as a more lightweight alternative to VirtualBox. Last year I asked academic Twitter, and containers seemed by a small margin the desired solution. Ideally, most of the scripts can be used, and just be executed in a suitable Docker container. Though, I still haven’t tried it.


I’d like to thank Richard and Guido for comments on a draft.