Tracing vs. Partial Evaluation: Artifact Description

Submitted Paper Draft

This document gives an overview of the experimental setup used for our paper. We provide brief setup instructions to facilitate re-execution, and detail how the benchmark results were processed for the paper.

This material here is also available in a public GitHub repository.

1. Getting Started Guide

The artifact is provided as a VirtualBox image. For separate source and data downloads, please see section 3.

1.1 Download

VirtualBox Image: Mirror 1, Mirror 2

MD5 check sum: d7ac6b99ba4f02efe2ac8f7685176a80

1.2 Setup Instructions

The VirtualBox image was created with version 4.3.28 available from virtualbox.org.

The image contains a Lubuntu 15.04 installation:

1.3 Basic Experiment Execution and Data Analysis

Execute benchmarks (takes about 70h):

`rebench rebench.conf`

Collect implementation size statistics:

`scripts/patch-statistics.sh`

Generate report:

`scripts/knit.R evaluation.Rmd`

The VirtualBox image starts up with a shell in the ~/experiments folder. The benchmarks are executed by the ReBench tool.

To run them, execute rebench rebench.conf. Executing all benchmarks takes about 70 hours. It will create the file data/benchmark.data, however, it is not directly used by the analysis scripts.

Note, ReBench gives a warning that the process' niceness cannot be set. To reduce measurement errors, it is advisable to run with sudo so that the OS scheduler does not interfere unnecessarily with the benchmarks.

The scripts/patch-statistics.sh script collects the implementation and change sizes used in the evaluation and stores the results as .csv files in data/.

Based either on the supplied data or the newly generated data, a report can be generated based on the evaluation.Rmd file. In case the benchmark data was obtained running ReBench, note that the data_file variable in evaluation.Rmd needs to be adjusted by removing the .bz2 file extension. The report will be stored in evaluation.html.

2. The Artifacts and Claims

The artifacts provided with our paper are intended to enable others to verify that:

Furthermore, we would like to recommend SOM as a dynamic language implementation of very manageable size that reaches high performance and therefore enables a wide range research experiments.

Further material on SOM:

3. Step-by-Step Instructions to the Experiments

This section gives a more detailed overview of the setup and execution of the experiments.

3.1 Download

In addition to the VirtualBox image (cf. 1.1), we also provide the elements of the artifact as separate downloads.

3.2 Software Dependencies

The general software requirements are as follows:

On a Ubuntu-based system, the following packages should provide the required software:

sudo apt-get install build-essential ant pypy libffi-dev pkg-config git \
     make ant python-pip python-scipy r-base

sudo pip install ReBench

A compatible JDK can be downloaded from the Java SE 8 Archive.

3.3 Checkout Git Repository and Build Experiments

Our experiments are managed with git. We use submodules to track the versions of the various branches and ensure that all experiments are built based on the correct source code.

The setup for this paper is in the papers/metatracing-vs-partialevaluation branch of the repository https://github.com/smarr/selfopt-interp-performance.

To checkout the repository:

git clone --recursive -b papers/metatracing-vs-partialevaluation https://github.com/smarr/selfopt-interp-performance mt-vs-pe

Note that the cloning can take a while since the repository contains about 20 experiments and larger submodules such as the Graal codebase. A full git clone has a size of about 770MB currently.

To build all code artifacts switch to the implementations folder and execute the setup.sh script or the separate build-*.sh files.

cd mt-vs-pe         # folder with the repository
cd implementations  # sources of the experiments, Graal, benchmarks, etc.
./setup.sh

Building all experiments can take multiple hours. Especially the compilation of the RPython based experiments takes about 15min each. In case compilation fails at any point, the build-*.sh files can be started manually. The build-rtrufflesom.sh (for SOM_MT) and build-trufflesom.sh (for SOM_PE) scripts loop over the experiments and execute the makefiles. In case something goes wrong, it is helpful to comment out the make clean to avoid recompiling everything.

WARNING: the R-minimal-without-jit-annotations experiment won't compile, because it does not contain any jit-driver for RPython to indicate where meta-tracing can start. The scripts are currently not robust enough to take this into account automatically. Sorry for the inconvenience.

3.3 Execution of Experiments

To execute the benchmarks, we use the ReBench benchmarking tool. The experiments and all benchmark parameters are configured in the rebench.conf file. The file has three main sections, benchmark_suites, virtual_machines, and experiments. They describe the settings for all experiments. Note that the names used in the configuration file are post-processed for the paper in the R scripts used to generate graphs, thus, the configuration contains all necessary information to find the benchmark implementations in the repositories, but does not match exactly the names in the paper.

Two important parameter to ReBench are the -d switch, which shows debug output, and the -N switch which disables the use of the nice command to increase the process priority of the benchmarks. The -N is only necessary when root or sudo are not available.

To run the benchmarks:

cd mt-vs-le  # folder with the repository

sudo rebench -d rebench.conf  # runs with additional debug output

All benchmarks results are recorded in the data/benchmark.data file. The benchmarks can be interrupted at any point and ReBench will later continue the execution where it left off. However, the results of partial runs of one virtual machine invocation are not recorded to avoid mixing up results from before and after the warmup phases.

When executed in debug mode -d the output can be verbose. Most warnings can be ignored as long as ReBench is able to obtain the benchmark results. If however execution of experiments fails, the output will contain for instance the used command line to run the experiments, which allows debugging the issue.

Executing a Subset of Benchmarks

When not all experiments need to be execute for instance to verify the performance of only a subset of them, one can comment out the experiment names given in the variable_values: section of the corresponding benchmark suites. Similarly, the benchmark set itself can also be reduced to a subset by commenting out the corresponding lines in the benchmarks: section of a benchmark suite.

For example, to only compare the baseline against the version without array strategies, the settings for steady-trufflesom should be edited to resemble the following snippet.

benchmark_suites:
    steady-trufflesom:
        gauge_adapter: RebenchLog
        command: &SOM_CMD " -cp Smalltalk:Examples/Benchmarks/Json:Examples/Benchmarks/LanguageFeatures:Examples/Benchmarks/GraphSearch:Examples/Benchmarks/Richards:Examples/Benchmarks/DeltaBlue:Examples/Benchmarks/NBody Examples/Benchmarks/BenchmarkHarness.som  %(benchmark)s "
        max_runtime: 60000
        variable_values: &TSOM_EXP
            - baseline
            #- minimal
            #- without-args-in-frame
            - without-array-strategies
            #- without-blocks-without-context
            #- without-catch-nonlocal-return-node
            # ... more commented out below

If only for instance the DeltaBlue benchmark is desired, the benchmark: settings below the section given above can be changed to:

            #- without-unessential-lowering-prims
            #- without-var-access-specialization
        benchmarks: &FULL_WARMUP
            - DeltaBlue:
                extra_args: "1000 0 6000"
            #- Fannkuch:
            #    extra_args: "500 0 9"
            #- Mandelbrot:
            #    extra_args: "500 0 1500"
            #- Richards:
            #    extra_args: "500 0 100"

To obtain the statistics about implementation size, the scripts/patch-statistics.sh is used. It relies on the one hand on git --diff --shortstat to determine the differences between the baseline branch and an experiment. On the other hand, it uses the cloc tool, which is contained as a copy in this repository. The script iterates over the experiments and outputs the results into *.csv files in the data/ folder.

3.4 Report Generation and Comparison

Before the results can be processed, a few R libraries have to be installed. For this step R might require superuser rights. See scripts/libraries.R for details.

sudo Rscript scripts/libraries.R

After the libraries have been installed, the actual report can be generated by executing scripts/knit.R evaluation.Rmd. This will use the KnitR tool to generate an HTML file from the markdown file with embedded R code.

The result should look like this example. The report does not directly discuss the results. Please see the paper draft for that. Instead, the report discusses how the results are evaluated to enable future studies based on our results and evaluation.

Using Another Data File

Note, the evaluation.Rmd uses the data/benchmark.data.bz2 file. To change that please see the top of the file around line 21. The variable data_file needs to be adapted for instance to point to the output of ReBench, which is normally in data/benchmark.data (without .bz2).

The adapted section in evaluation.Rmd should look like this:

22:  source("scripts/config.R", chdir=TRUE)
23:  data_file = "../data/benchmark.data"  ## <<-- was: data_file = "../data/benchmark.data.bz2"
24:  source("scripts/init.R", chdir=TRUE)

Licensing

The material in this repository is licensed under the terms of the MIT License. Please note, the repository links in form of submodules to other repositories which are licensed under different terms.