Rigorous benchmarking in reasonable time

Experimental evaluation is key to systems research. Because modern systems are complex and non-deterministic, good experimental methodology demands that researchers account for uncertainty. To obtain valid results, they are expected to run many iterations of benchmarks, invoke virtual machines (VMs) several times, or even rebuild VM or benchmark binaries more than once. All this repetition costs time to complete experiments. Currently, many evaluations give up on sufficient repetition or rigorous statistical methods, or even run benchmarks only in training sizes. The results reported often lack proper variation estimates and, when a small difference between two systems is reported, some are simply unreliable. In contrast, we provide a statistically rigorous methodology for repetition and summarising results that makes efficient use of experimentation time. Time efficiency comes from two key observations. First, a given benchmark on a given platform is typically prone to much less non-determinism than the common worst-case of published corner-case studies. Second, repetition is most needed where most uncertainty arises (whether between builds, between executions or between iterations). We capture experimentation cost with a novel mathematical model, which we use to identify the number of repetitions at each level of an experiment necessary and sufficient to obtain a given level of precision. We present our methodology as a cookbook that guides researchers on the number of repetitions they should run to obtain reliable results. We also show how to present results with an effect size confidence interval. As an example, we show how to use our methodology to conduct throughput experiments with the DaCapo and SPEC CPU benchmarks on three recent platforms.

[1]  Jacob Cohen The earth is round (p < .05) , 1994 .

[2]  David J. Lilja,et al.  Measuring computer performance : A practitioner's guide , 2000 .

[3]  Lizy Kurian John,et al.  Efficiently Evaluating Speedup Using Sampled Processor Simulation , 2004, IEEE Computer Architecture Letters.

[4]  E. C. Fieller SOME PROBLEMS IN INTERVAL ESTIMATION , 1954 .

[5]  R. Coe,et al.  It's the Effect Size, Stupid What effect size is and why it is important , 2012 .

[6]  Lieven Eeckhout,et al.  Java performance evaluation through rigorous replay compilation , 2008, OOPSLA.

[7]  Matthew Arnold,et al.  Online feedback-directed optimization of Java , 2002, OOPSLA '02.

[8]  V. Guiard,et al.  The robustness of parametric statistical methods , 2004 .

[9]  Karl J. Friston,et al.  Variance Components , 2003 .

[10]  Bruce Thompson,et al.  Computing and Interpreting Effect Sizes , 2004 .

[11]  Petr Tuma,et al.  Precise Regression Benchmarking with Random Effects: Improving Mono Benchmark Results , 2006, EPEW.

[12]  Toshio Nakatani,et al.  Replay compilation: improving debuggability of a just-in-time compiler , 2006, OOPSLA '06.

[13]  S. R. Searle,et al.  Generalized, Linear, and Mixed Models , 2005 .

[14]  Anirban DasGupta,et al.  Robustness of Standard Confidence Intervals for Location Parameters Under Departure from Normality , 1995 .

[15]  Dayong Gu,et al.  Code Layout as a Source of Noise in JVM Performance , 2005, Stud. Inform. Univ..

[16]  Amer Diwan,et al.  The DaCapo benchmarks: java benchmarking development and analysis , 2006, OOPSLA '06.

[17]  Ray Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[18]  Emery D. Berger,et al.  STABILIZER: statistically sound performance evaluation , 2013, ASPLOS '13.

[19]  Matthias Hauswirth,et al.  Producing wrong data without doing anything obviously wrong! , 2009, ASPLOS.

[20]  I. Cuthill,et al.  Effect size, confidence interval and statistical significance: a practical guide for biologists , 2007, Biological reviews of the Cambridge Philosophical Society.

[21]  Scott E. Maxwell,et al.  Designing Experiments and Analyzing Data: A Model Comparison Perspective , 1990 .

[22]  N. Schenker,et al.  Overlapping confidence intervals or standard error intervals: What do they mean in terms of statistical significance? , 2003, Journal of insect science.

[23]  R. Royall The Effect of Sample Size on the Meaning of Significance Tests , 1986 .

[24]  Lieven Eeckhout,et al.  Statistically rigorous java performance evaluation , 2007, OOPSLA.