The ghost in the machine: Don't let it haunt your software performance measurements

This paper describes pitfalls, issues, and methodology for measuring software performance. Ideally, measurement should be performed and reported in such a way that others will be able to reproduce the results in order to confirm their validity. We aim to motivate scientists to apply the necessary rigor to the design and execution of their software performance measurements to achieve reliable results. Repeatability of experiments, comparability of reported results, and verifiability of claims that are based on such results can be achieved only when measurements and reporting procedures can be trusted. In short, this paper urges the reader to measure the right performance and to measure the performance right.

[1]  Jan Vitek Repeatability, reproducibility and rigor in CS research , 2015, PLMW '15.

[2]  Francis Giraldeau,et al.  Recovering System Metrics from Kernel Trace , 2011 .

[3]  Qin Zhao,et al.  Practical memory checking with Dr. Memory , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[4]  Dan Tsafrir,et al.  Reducing Performance Evaluation Sensitivity and Variability by Input Shaking , 2007, 2007 15th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.

[5]  Hiroyuki Tanaka,et al.  Power reduction effect of higher room temperature operation in data centers , 2010, 2010 IEEE Network Operations and Management Symposium - NOMS 2010.

[6]  Sally A. McKee,et al.  Portable, scalable, per-core power estimation for intelligent resource management , 2010, International Conference on Green Computing.

[7]  Nian-Feng Tzeng,et al.  Run-time Energy Consumption Estimation Based on Workload in Server Systems , 2008, HotPower.

[8]  David J. Lilja,et al.  Measuring computer performance : A practitioner's guide , 2000 .

[9]  David W. Flater,et al.  Configuration of profiling tools for C/C++ applications under 64-bit Linux , 2013 .

[10]  David W. Flater Estimation of uncertainty in application profiles , 2014 .

[11]  Elaine B. Barker,et al.  A Statistical Test Suite for Random and Pseudorandom Number Generators for Cryptographic Applications , 2000 .

[12]  David Flater,et al.  A Case Study of Performance Degradation Attributable to Run-Time Bounds Checks on C++ Vector Access , 2013, Journal of research of the National Institute of Standards and Technology.

[13]  David J. Lilja,et al.  Statistical methods for computer performance evaluation , 2012 .

[14]  Margaret Martonosi,et al.  Dynamic thermal management for high-performance microprocessors , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[15]  Richard A. Johnson,et al.  Applied Multivariate Statistical Analysis , 1983 .

[16]  William H. Mangione-Smith,et al.  The filter cache: an energy efficient memory structure , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[17]  Martin Schatzoff,et al.  Design of Experiments in Computer Performance Evaluation , 1981, IBM J. Res. Dev..

[18]  Peter F. Sweeney,et al.  Multiple page size modeling and optimization , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[19]  Matthias Hauswirth,et al.  Accuracy of performance counter measurements , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[20]  Lieven Eeckhout,et al.  Statistically rigorous java performance evaluation , 2007, OOPSLA.

[21]  Martin Schatroff Design of experiments in computer performance evaluation , 1981 .

[22]  Paulina Jaramillo,et al.  Life Cycle Assessment and Grid Electricity , 2010 .

[23]  Dean M. Tullsen,et al.  Compiler Techniques for Reducing Data Cache Miss Rate on a Multithreaded Architecture , 2008, HiPEAC.

[24]  Robert F. Berry Computer Benchmark Evaluation and Design of Experiments, a Case Study , 1992, IEEE Trans. Computers.

[25]  Hubertus Franke,et al.  Multiple page size support in the Linux kernel , 2002 .

[26]  Hamed Mohsenian Rad,et al.  Exploring smart grid and data center interactions for electric power load balancing , 2014, PERV.

[27]  Jan Vitek,et al.  R3: repeatability, reproducibility and rigor , 2012, SIGP.

[28]  Li Shang,et al.  Power, Thermal, and Reliability Modeling in Nanometer-Scale Microprocessors , 2007, IEEE Micro.

[29]  Sally A. McKee,et al.  Real time power estimation and thread scheduling via performance counters , 2009, CARN.

[30]  Scott E. Maxwell,et al.  Designing Experiments and Analyzing Data: A Model Comparison Perspective , 1990 .

[31]  Rahul Khanna,et al.  A novel approach to memory power estimation using machine learning , 2010, 2010 International Conference on Energy Aware Computing.

[32]  Tomas Kalibera,et al.  Rigorous benchmarking in reasonable time , 2013, ISMM '13.

[33]  Margaret J. Robertson,et al.  Design and Analysis of Experiments , 2006, Handbook of statistics.

[34]  George Kesidis,et al.  Pricing of service in clouds: optimal response and strategic interactions , 2014, PERV.

[35]  Wei Jiang,et al.  Secure and threshold-based power usage control in smart grid environments , 2014, Int. J. Parallel Emergent Distributed Syst..

[36]  Yong Meng Teo,et al.  On understanding the energy consumption of ARM-based multicore servers , 2013, SIGMETRICS '13.

[37]  Robert L. Mason,et al.  Fractional factorial design , 2009 .

[38]  Mahadev Satyanarayanan,et al.  PowerScope: a tool for profiling the energy usage of mobile applications , 1999, Proceedings WMCSA'99. Second IEEE Workshop on Mobile Computing Systems and Applications.

[39]  A. Gilles,et al.  The Art of Computer Systems Performance Analysis (Techniques for Experimental Design, Measurement, Simulation, and Modeling) , 1992 .

[40]  Xi Yang,et al.  Looking back and looking forward , 2012, Commun. ACM.

[41]  Ray Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[42]  Emery D. Berger,et al.  STABILIZER: statistically sound performance evaluation , 2013, ASPLOS '13.

[43]  Matthias Hauswirth,et al.  Producing wrong data without doing anything obviously wrong! , 2009, ASPLOS.

[44]  Gernot Heiser,et al.  An Analysis of Power Consumption in a Smartphone , 2010, USENIX Annual Technical Conference.