How do we do Benchmarking?
Impressions from Conversations with the Community
This summer, I talked to a number of groups from the community on how they do benchmarking for instance as part of their day-to-day engineering and for the evaluation of ideas for research papers. It talked with them about their general approach and the tools they use.
This showed a wide variety of approaches, opinions, preferences, and tools. Though, it also showed me the there remains a lot of work to be done to get best practices adopted, and perhaps even build better tools, research better ways of benchmarking, and adopt reliable means for data processing and data analysis.
At the Virtual Machine Meetup (VMM’21), I report a bit on my general impressions, recounted the issues people talked about, their solutions or desire for solutions, the good practices observed, as well as ideas for improvements. There is a lot of diversity in the approaches used in the community. Some of the diversity comes from the wide range of research questions people are interested in, but a significant amount seems to be caused by the effort and expertise required for benchmarking, which is arguably very high.
Methodology
Before looking at the slides, please note that what we have here is based on on a small set of interviews, using a semi-structured approach. This means, the discussions were open-ended, and I did not ask exactly the same questions. Furthermore, much of the data is based on inference from these discussions.
Best Practices
This being said, there are a couple of points that seemed to be good practices perhaps worth advocating for.
- 
    Use Automated Testing/Continuous Integration Correctness comes before performance. While not every project may justify a huge investment into testing infrastructure, I know that everything that I do not test is broken. Thus, at the very least, we need to make sure that our benchmarks compute the expected results. 
- 
    Use Same Setup for Day-to-Day Engineering as for Benchmarks used in Papers To often benchmarks are the last thing in the process, and the week before the deadline there’s a sense of panic. Benchmarking is not impossibly hard, but also far from trivial. A good and reliable setup takes time, start from day 1, use it while building your system/experiment, and then have it ready and tested when you want results for the paper. 
- 
    Most Continuous Integration Systems will Manage Artifacts Keeping track of benchmark results is hard. If you’re using CI (see point 1), you could probably use it to store benchmark results as well. This means, you automatically keep track of at least a part of the relevant bits of information needed to figure out what the data means when trying to analyse it. 
- 
    Automate Data Handling Copying data around manually makes it too easy to make mistakes. For instance when using spreadsheets, don’t copy data around manually. Instead, try to use the spreadsheet’s system for data import, eliminating one source of easy mistakes. Of course, having things automated, likely means that rerunning and analyzing results after a bug fix or last minute change becomes much easier. 
- 
    Define Workflow that Works for Your Group Too often the knowledge of how to do benchmarking and performance evaluation can be in the head of a PhD student, who may leave after finishing. Instead of letting the knowledge leave with them, it’s worthwhile to actively start teaching how to run good benchmarks. It also makes the life of new team members much easier… 
These are just some quick thoughts after giving my talk. If you have questions, feel free to reach out via Twitter.
 I head the
          I head the