Producing wrong data without doing anything obviously wrong!

This paper presents a surprising result: changing a seemingly innocuous aspect of an experimental setup can cause a systems researcher to draw wrong conclusions from an experiment. What appears to be an innocuous aspect in the experimental setup may in fact introduce a significant bias in an evaluation. This phenomenon is called measurement bias in the natural and social sciences. Our results demonstrate that measurement bias is significant and commonplace in computer system evaluation. By significant we mean that measurement bias can lead to a performance analysis that either over-states an effect or even yields an incorrect conclusion. By commonplace we mean that measurement bias occurs in all architectures that we tried (Pentium 4, Core 2, and m5 O3CPU), both compilers that we tried (gcc and Intel's C compiler), and most of the SPEC CPU2006 C programs. Thus, we cannot ignore measurement bias. Nevertheless, in a literature survey of 133 recent papers from ASPLOS, PACT, PLDI, and CGO, we determined that none of the papers with experimental results adequately consider measurement bias. Inspired by similar problems and their solutions in other sciences, we describe and demonstrate two methods, one for detecting (causal analysis) and one for avoiding (setup randomization) measurement bias.

[1]  Sam Kash Kachigan Statistical Analysis: An Interdisciplinary Introduction to Univariate & Multivariate Methods , 1986 .

[2]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[3]  Jack J. Dongarra,et al.  A Scalable Cross-Platform Infrastructure for Application Performance Tuning Using Hardware Counters , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[4]  Patricia J. Teller,et al.  Just how accurate are performance counters? , 2001, Conference Proceedings of the 2001 IEEE International Performance, Computing, and Communications Conference (Cat. No.01CH37210).

[5]  Amer Diwan,et al.  Energy Consumption and Garbage Collection in Low-Powered Computing ; CU-CS-930-02 , 2002 .

[6]  Patricia J. Teller,et al.  Accuracy of Performance Monitoring Hardware , 2002 .

[7]  Shirley Moore A Comparison of Counting and Sampling Modes of Using Performance Monitoring Hardware , 2002, International Conference on Computational Science.

[8]  David A. Wood,et al.  Variability in architectural simulations of multi-threaded workloads , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[9]  Perry Cheng,et al.  Myths and realities: the performance impact of garbage collection , 2004, SIGMETRICS '04/Performance '04.

[10]  J. Ioannidis Contradicted and Initially Stronger Effects in Highly Cited Clinical Research , 2005 .

[11]  Petr Tuma,et al.  Benchmark Precision and Random Initial State , 2005 .

[12]  Amer Diwan,et al.  The DaCapo benchmarks: java benchmarking development and analysis , 2006, OOPSLA '06.

[13]  No License,et al.  Intel ® 64 and IA-32 Architectures Software Developer ’ s Manual Volume 3 A : System Programming Guide , Part 1 , 2006 .

[14]  Amer Diwan,et al.  Understanding the behavior of compiler optimizations , 2006, Softw. Pract. Exp..

[15]  Ronald G. Dreslinski,et al.  The M5 Simulator: Modeling Networked Systems , 2006, IEEE Micro.

[16]  Dan Tsafrir,et al.  Reducing Performance Evaluation Sensitivity and Variability by Input Shaking , 2007, 2007 15th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.

[17]  Lieven Eeckhout,et al.  Statistically rigorous java performance evaluation , 2007, OOPSLA.