R3: repeatability, reproducibility and rigor

Computer systems research spans sub-disciplines that include embedded systems, programming languages and compilers, networking, and operating systems. Our contention is that a number of structural factors inhibit quality systems research. We highlight some of the factors we have encountered in our own work and observed in published papers and propose solutions that could both increase the productivity of researchers and the quality of their output.

[1]  Jan Vitek,et al.  Automated construction of JavaScript benchmarks , 2011, OOPSLA '11.

[2]  Les Kirkup Experimental Methods: An Introduction to the Analysis and Presentation of Data , 1994 .

[3]  Lieven Eeckhout,et al.  Java performance evaluation through rigorous replay compilation , 2008, OOPSLA.

[4]  V. Guiard,et al.  The robustness of parametric statistical methods , 2004 .

[5]  Matthias Hauswirth,et al.  Producing wrong data without doing anything obviously wrong! , 2009, ASPLOS.

[6]  Perry Cheng,et al.  Oil and water? High performance garbage collection in Java with MMTk , 2004, Proceedings. 26th International Conference on Software Engineering.

[7]  I. N. Moraes,et al.  Introduction to scientific research , 1966 .

[8]  Petr Tuma,et al.  Precise Regression Benchmarking with Random Effects: Improving Mono Benchmark Results , 2006, EPEW.

[9]  Petr Tuma,et al.  Automated detection of performance regressions: the mono experience , 2005, 13th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.

[10]  Lieven Eeckhout,et al.  Statistically rigorous java performance evaluation , 2007, OOPSLA.

[11]  Eli M. Dow,et al.  Xen and the Art of Repeated Research , 2004, USENIX Annual Technical Conference, FREENIX Track.

[12]  Tomas Kalibera,et al.  Reducing performance non-determinism via cache-aware page allocation strategies , 2010, WOSP/SIPEW '10.

[13]  David J. Lilja,et al.  Measuring computer performance : A practitioner's guide , 2000 .

[14]  Jan Vitek,et al.  A family of real‐time Java benchmarks , 2011, Concurr. Comput. Pract. Exp..

[15]  Anthony C. Davison,et al.  Bootstrap Methods and Their Application , 1998 .

[16]  Björn Regnell,et al.  How to read and write a scientific evaluation paper , 2009 .

[17]  Ray Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[18]  Dayong Gu,et al.  Code Layout as a Source of Noise in JVM Performance , 2005, Stud. Inform. Univ..

[19]  Dan S. Wallach,et al.  Rebooting the CS publication process , 2011, Commun. ACM.

[20]  Jan Vitek,et al.  An analysis of the dynamic behavior of JavaScript programs , 2010, PLDI '10.

[21]  E. C. Fieller SOME PROBLEMS IN INTERVAL ESTIMATION , 1954 .

[22]  Stephen J. Fink,et al.  The Jalapeño virtual machine , 2000, IBM Syst. J..

[23]  Björn Regnell,et al.  How to Write and Read a Scientific Evaluation Paper , 2009, 2009 17th IEEE International Requirements Engineering Conference.

[24]  Lizy Kurian John,et al.  Efficiently Evaluating Speedup Using Sampled Processor Simulation , 2004, IEEE Computer Architecture Letters.

[25]  K. Coombes,et al.  Deriving chemosensitivity from cell lines: Forensic bioinformatics and reproducible research in high-throughput biology , 2009, 1010.1092.

[26]  Mason Chang,et al.  Trace-based just-in-time type specialization for dynamic languages , 2009, PLDI '09.

[27]  Moshe Y. Vardi Conferences vs. journals in computing research , 2009, CACM.

[28]  Scott E. Maxwell,et al.  Designing Experiments and Analyzing Data: A Model Comparison Perspective , 1990 .

[29]  Barry N. Taylor,et al.  Guidelines for Evaluating and Expressing the Uncertainty of Nist Measurement Results , 2017 .

[30]  Anirban DasGupta,et al.  Robustness of Standard Confidence Intervals for Location Parameters Under Departure from Normality , 1995 .

[31]  Amer Diwan,et al.  The DaCapo benchmarks: java benchmarking development and analysis , 2006, OOPSLA '06.