Precise Computer Performance Comparisons Via Statistical Resampling Methods

Performance variability, stemming from nondeterministic hardware and software behaviors or deterministic behaviors such as measurement bias, is a well-known phenomenon of computer systems which increases the difficulty of comparing computer performance metrics. Conventional methods use various measures (such as geometric mean) to quantify the performance of different benchmarks to compare computers without considering variability. This may lead to wrong conclusions. In this paper, we propose three resampling methods for performance evaluation and comparison: a randomization test for a general performance comparison between two computers, bootstrapping confidence estimation, and an empirical distribution and five-number-summary for performance evaluation. The results show that 1) the randomization test substantially improves our chance to identify the difference between performance comparisons when the difference is not large; 2) bootstrapping confidence estimation provides an accurate confidence interval for the performance comparison measure (e.g. ratio of geometric means); and 3) when the difference is very small, a single test is often not enough to reveal the nature of the computer performance due to the variability of computer systems. We further propose using empirical distribution to evaluate computer performance and a five-number-summary to summarize computer performance. We illustrate the results and conclusion through detailed Monte Carlo simulation studies and real examples. Results show that our methods are precise and robust even when two computers have very similar performance metrics. Keywords— Performance of Systems; Performance attributes; Measurement, evaluation, modeling, simulation of multipleprocessor systems; Experimental design

[1]  Daniel Citron,et al.  The harmonic or geometric mean: does it really matter? , 2006, CARN.

[2]  Tianshi Chen,et al.  Statistical Performance Comparisons of Computers , 2012, IEEE Transactions on Computers.

[3]  H. J. Arnold Introduction to the Practice of Statistics , 1990 .

[4]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[5]  Philip J. Fleming,et al.  How not to lie with statistics: the correct way to summarize benchmark results , 1986, CACM.

[6]  John R. Mashey,et al.  War of the benchmark means: time for a truce , 2004, CARN.

[7]  Lizy K. John,et al.  Confusion by All Means , 2006 .

[8]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[9]  David J. Lilja,et al.  Using Resampling Techniques to Compute Confidence Intervals for the Harmonic Mean of Rate-Based Performance Metrics , 2010, IEEE Computer Architecture Letters.

[10]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[11]  Lieven Eeckhout,et al.  Computer Architecture Performance Evaluation Methods , 2010, Computer Architecture Performance Evaluation Methods.

[12]  W. J. Langford Statistical Methods , 1959, Nature.

[13]  Richard A. Johnson,et al.  Statistics: Principles and Methods , 1985 .

[14]  David J. Lilja,et al.  Measuring computer performance : A practitioner's guide , 2000 .

[15]  David A. Wood,et al.  Variability in architectural simulations of multi-threaded workloads , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[16]  Matthias Hauswirth,et al.  Why you should care about quantile regression , 2013, ASPLOS '13.

[17]  Lizy Kurian John,et al.  More on finding a single number to indicate overall performance of a benchmark suite , 2004, CARN.

[18]  James E. Smith,et al.  Characterizing computer performance with a single number , 1988, CACM.

[19]  Lieven Eeckhout,et al.  Statistically rigorous java performance evaluation , 2007, OOPSLA.

[20]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[21]  E. Edgington,et al.  Randomization Tests (3rd ed.) , 1998 .

[22]  Eugene S. Edgington,et al.  Randomization Tests , 2011, International Encyclopedia of Statistical Science.

[23]  Philip H. Ramsey Nonparametric Statistical Methods , 1974, Technometrics.

[24]  Debashis Kushary,et al.  Bootstrap Methods and Their Application , 2000, Technometrics.

[25]  Emery D. Berger,et al.  STABILIZER: statistically sound performance evaluation , 2013, ASPLOS '13.

[26]  Matthias Hauswirth,et al.  Producing wrong data without doing anything obviously wrong! , 2009, ASPLOS.