Robust benchmarking in noisy environments

We propose a benchmarking strategy that is robust in the presence of timer error, OS jitter and other environmental fluctuations, and is insensitive to the highly nonideal statistics produced by timing measurements. We construct a model that explains how these strongly nonideal statistics can arise from environmental fluctuations, and also justifies our proposed strategy. We implement this strategy in the BenchmarkTools Julia package, where it is used in production continuous integration (CI) pipelines for developing the Julia language and its ecosystem.

[1]  Kevin Klues,et al.  Tessellation: space-time partitioning in a manycore client OS , 2009 .

[2]  Emery D. Berger,et al.  STABILIZER: statistically sound performance evaluation , 2013, ASPLOS '13.

[3]  Osamu Tatebe,et al.  Reduction of operating system jitter caused by page reclaim , 2014, ROSS@ICS.

[4]  Matthias Hauswirth,et al.  Producing wrong data without doing anything obviously wrong! , 2009, ASPLOS.

[5]  Joseph Gil,et al.  A microbenchmark case study and lessons learned , 2011, SPLASH Workshops.

[6]  Matthias Hauswirth,et al.  Accuracy of performance counter measurements , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[7]  Dan Tsafrir,et al.  The context-switch overhead inflicted by hardware interrupts (and the enigma of do-nothing loops) , 2007, ExpCS '07.

[8]  Ickjai Lee,et al.  Automated Outlier Removal for Mobile Microbenchmarking Datasets , 2015, 2015 10th International Conference on Intelligent Systems and Knowledge Engineering (ISKE).

[9]  Lieven Eeckhout,et al.  Java performance evaluation through rigorous replay compilation , 2008, OOPSLA.

[10]  Tomas Kalibera,et al.  Rigorous benchmarking in reasonable time , 2013, ISMM '13.

[11]  Osamu Tatebe,et al.  Experimental analysis of operating system jitter caused by page reclaim , 2016, The Journal of Supercomputing.

[12]  Perry Cheng,et al.  The garbage collection advantage: improving program locality , 2004, OOPSLA.

[13]  David J. Lilja,et al.  Measuring computer performance : A practitioner's guide , 2000 .

[14]  F. Petrini,et al.  The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[15]  Sarah Mount,et al.  Virtual machine warmup blows hot and cold , 2016, Proc. ACM Program. Lang..

[16]  Alexandre Bergel,et al.  Tracking down performance variation against source code evolution , 2015, DLS.

[17]  Je-Min Kim,et al.  AndroBench: Benchmarking the Storage Performance of Android-Based Mobile Devices , 2011, ICFCE.

[18]  Perry Cheng,et al.  Myths and realities: the performance impact of garbage collection , 2004, SIGMETRICS '04/Performance '04.

[19]  V. Yohai,et al.  Robust Statistics: Theory and Methods , 2006 .

[20]  Lorie M. Liebrock,et al.  Stepping towards noiseless Linux environment , 2012, ROSS '12.

[21]  Mohammad S. Obaidat,et al.  Performance Evaluation of Computer and Telecommunications Systems , 1999, Simul..

[22]  Tianshi Chen,et al.  Statistical Performance Comparisons of Computers , 2012, IEEE Transactions on Computers.

[23]  Hovav Shacham,et al.  On the effectiveness of address-space randomization , 2004, CCS '04.

[24]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[25]  Petr Tuma,et al.  Benchmark Precision and Random Initial State , 2005 .

[26]  Vivien Quéma,et al.  The Linux scheduler: a decade of wasted cores , 2016, EuroSys.