Methodological Principles for Reproducible Performance Evaluation in Cloud Computing SPEC RG Cloud Working Group

The rapid adoption and the diversification of cloud computing technology exacerbate the importance of a sound experimental methodology for this domain. This work investigates how to measure and report performance in the cloud, and how well the cloud research community is already doing it. We propose a set of eight important methodological principles that combine best-practices from nearby fields with concepts applicable only to clouds, and with new ideas about the time-accuracy trade-off. We show how these principles are applicable using a practical use-case experiment. To this end, we analyze the ability of the newly released SPEC Cloud IaaS benchmark to follow the principles, and showcase real-world experimental studies in common cloud environments that meet the principles. Last, we report on a systematic literature review including top conferences and journals in the field, from 2012 to 2017, analyzing if the practice of reporting cloud performance measurements follows the proposed eight principles. Worryingly, this systematic survey and the subsequent two-round human reviews, reveal that few of the published studies follow the eight experimental principles. We conclude that, although these important principles are simple and basic, the cloud community is yet to adopt them broadly to deliver sound measurement of cloud environments.

[1]  Allen D. Malony,et al.  Models for performance perturbation analysis , 1991, PADD '91.

[2]  Antonín Steinhauser,et al.  DOs and DON'Ts of Conducting Performance Measurements in Java , 2015, ICPE.

[3]  Petr Tuma,et al.  Precise Regression Benchmarking with Random Effects: Improving Mono Benchmark Results , 2006, EPEW.

[4]  Grigori Melnik,et al.  On the success of empirical studies in the international conference on software engineering , 2006, ICSE.

[5]  Ian T. Jolliffe 10. Exploratory and Multivariate Data Analysis , 1993 .

[6]  Jóakim von Kistowski,et al.  How to Build a Benchmark , 2015, ICPE.

[7]  Rouven Krebs,et al.  Ready for Rain? A View from SPEC Research on the Future of Cloud Metrics , 2016, ArXiv.

[8]  Tim Brecht,et al.  Conducting Repeatable Experiments in Highly Variable Cloud Computing Environments , 2017, ICPE.

[9]  Alexandru Iosup,et al.  Sampling Bias in BitTorrent Measurements , 2010, Euro-Par.

[10]  Mike Hibler,et al.  Apt: A Platform for Repeatable Research in Computer Science , 2015, OPSR.

[11]  Grigori Fursin,et al.  Collective Knowledge: Towards R&D sustainability , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[12]  Balachander Krishnamurthy,et al.  A Socratic method for validation of measurement-based networking research , 2011, Comput. Commun..

[13]  Cristina L. Abad,et al.  Methodological Principles for Reproducible Performance Evaluation in Cloud Computing , 2019, IEEE Transactions on Software Engineering.

[14]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[15]  Eric Eide,et al.  An Experimentation Workbench for Replayable Networking Research , 2007, NSDI.

[16]  Bruno Schulze,et al.  High Performance Computing Evaluation A methodology based on Scientific Application Requirements , 2014, ArXiv.

[17]  G. Annas,et al.  The whole truth and nothing but the truth? , 1988, The Hastings center report.

[18]  Pearl Brereton,et al.  Systematic literature reviews in software engineering - A systematic literature review , 2009, Inf. Softw. Technol..

[19]  Steven Hand,et al.  The Seven Deadly Sins of Cloud Computing Research , 2012, HotCloud.

[20]  Walter Binder,et al.  The JVM is not observable enough (and what to do about it) , 2012, VMIL '12.

[21]  Kai Petersen,et al.  Systematic Mapping Studies in Software Engineering , 2008, EASE.

[22]  Cristina L. Abad,et al.  Quantifying Cloud Performance and Dependability , 2018, ACM Trans. Model. Perform. Evaluation Comput. Syst..

[23]  Anton Nekrutenko,et al.  Ten Simple Rules for Reproducible Computational Research , 2013, PLoS Comput. Biol..

[24]  T. Hesterberg,et al.  What Teachers Should Know About the Bootstrap: Resampling in the Undergraduate Statistics Curriculum , 2014, The American statistician.

[25]  Jacob Cohen,et al.  Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .

[26]  Samuel Kounev,et al.  On the Value of Service Demand Estimation for Auto-scaling , 2018, MMB.

[27]  Allen D. Malony,et al.  Performance Measurement Intrusion and Perturbation Analysis , 1992, IEEE Trans. Parallel Distributed Syst..

[28]  Michael Ley,et al.  The DBLP Computer Science Bibliography: Evolution, Research Issues, Perspectives , 2002, SPIRE.

[29]  V PapadopoulosAlessandro,et al.  An Experimental Performance Evaluation of Autoscalers for Complex Workflows , 2018 .

[30]  Kai Petersen,et al.  Guidelines for conducting systematic mapping studies in software engineering: An update , 2015, Inf. Softw. Technol..

[31]  Mohamed Sayeed,et al.  Measuring High-Performance Computing with Real Applications , 2008, Computing in Science & Engineering.

[32]  Andreas Zeller,et al.  The Truth, The Whole Truth, and Nothing But the Truth , 2016, ACM Trans. Program. Lang. Syst..

[33]  Dror G. Feitelson,et al.  Pitfalls in Parallel Job Scheduling Evaluation , 2005, JSSPP.

[34]  José Nelson Amaral,et al.  The Alberta Workloads for the SPEC CPU 2017 Benchmark Suite , 2018, 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[35]  Alexandru Iosup,et al.  IaaS cloud benchmarking: approaches, challenges, and experience , 2013, HotTopiCS '13.

[36]  E. Iso,et al.  Measurement Uncertainty and Probability: Guide to the Expression of Uncertainty in Measurement , 1995 .

[37]  Larry L. Peterson,et al.  Using PlanetLab for network research: myths, realities, and best practices , 2005, OPSR.

[38]  Michael F. P. O'Boyle,et al.  Rapidly Selecting Good Compiler Optimizations using Performance Counters , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[39]  Amela Karahasanovic,et al.  A survey of controlled experiments in software engineering , 2005, IEEE Transactions on Software Engineering.

[40]  Alexandru Iosup,et al.  On the Performance Variability of Production Cloud Services , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[41]  S. Matteson The truth, the whole truth, and nothing but the truth. , 2012, Texas dental journal.

[42]  Jorge-Arnulfo Quiané-Ruiz,et al.  Runtime measurements in the cloud , 2010, Proc. VLDB Endow..

[43]  Lieven Eeckhout,et al.  Measuring benchmark similarity using inherent program characteristics , 2006, IEEE Transactions on Computers.

[44]  Thomas C. Herndon,et al.  Does high public debt consistently stifle economic growth? A critique of Reinhart and Rogoff , 2014 .

[45]  Matthias Hauswirth,et al.  Evaluating the accuracy of Java profilers , 2010, PLDI '10.

[46]  Thomas Reidemeister,et al.  DataMill: rigorous performance evaluation made easy , 2013, ICPE '13.

[47]  Sneha Kumar Kasera,et al.  The Flexlab Approach to Realistic Evaluation of Networked Systems , 2007, NSDI.

[48]  Philip J. Fleming,et al.  How not to lie with statistics: the correct way to summarize benchmark results , 1986, CACM.

[49]  Barry N. Taylor,et al.  Guidelines for Evaluating and Expressing the Uncertainty of Nist Measurement Results , 2017 .

[50]  Philipp Leitner,et al.  Patterns in the Chaos—A Study of Performance Variation and Predictability in Public IaaS Clouds , 2014, ACM Trans. Internet Techn..

[51]  Jóakim von Kistowski,et al.  SPEC CPU2017: Next-Generation Compute Benchmark , 2018, ICPE Companion.

[52]  Pearl Brereton,et al.  Performing systematic literature reviews in software engineering , 2006, ICSE.

[53]  Dror G. Feitelson Resampling with Feedback - A New Paradigm of Using Workload Data for Performance Evaluation , 2016, Euro-Par.

[54]  Lieven Eeckhout,et al.  Microarchitecture-Independent Workload Characterization , 2007, IEEE Micro.

[55]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[56]  Lieven Eeckhout,et al.  Statistically rigorous java performance evaluation , 2007, OOPSLA.

[57]  Torsten Hoefler,et al.  Scientific Benchmarking of Parallel Computing Systems Twelve ways to tell the masses when reporting performance results , 2017 .

[58]  John R. Mashey,et al.  War of the benchmark means: time for a truce , 2004, CARN.

[59]  Klaus-Dieter Lange,et al.  Identifying Shades of Green: The SPECpower Benchmarks , 2009, Computer.

[60]  David J. Lilja,et al.  Measuring computer performance : A practitioner's guide , 2000 .

[61]  Andrew Lumsdaine,et al.  The Value of Variance , 2016, ICPE.

[62]  Samuel Kounev,et al.  Chameleon: A Hybrid, Proactive Auto-Scaling Mechanism on a Level-Playing Field , 2019, IEEE Transactions on Parallel and Distributed Systems.

[63]  Cees T. A. M. de Laat,et al.  A Medium-Scale Distributed System for Computer Science Research: Infrastructure for the Long Term , 2016, Computer.

[64]  Y. Zhang,et al.  DataMill: a distributed heterogeneous infrastructure forrobust experimentation , 2016, Softw. Pract. Exp..

[65]  Carsten Franke,et al.  On Grid Performance Evaluation Using Synthetic Workloads , 2006, JSSPP.

[66]  Johan Tordsson,et al.  PEAS , 2016, ACM Trans. Model. Perform. Evaluation Comput. Syst..

[67]  Jan Vitek,et al.  R3: repeatability, reproducibility and rigor , 2012, SIGP.

[68]  Nick McKeown,et al.  Reproducible network experiments using container-based emulation , 2012, CoNEXT '12.

[69]  David A W Soergel,et al.  Rampant software errors may undermine scientific results , 2014, F1000Research.

[70]  Christian P. Robert Statistics Done Wrong: The Woefully Complete Guide , 2016 .

[71]  David W. Flater,et al.  The ghost in the machine: Don't let it haunt your software performance measurements , 2014 .

[72]  S S Stevens,et al.  On the Theory of Scales of Measurement. , 1946, Science.

[73]  Robert N. M. Watson,et al.  Queues Don't Matter When You Can JUMP Them! , 2015, NSDI.

[74]  Christian S. Collberg,et al.  Repeatability in computer systems research , 2016, Commun. ACM.

[75]  Ronald F. Boisvert,et al.  Incentivizing reproducibility , 2016, Commun. ACM.

[76]  Matthias Hauswirth,et al.  Producing wrong data without doing anything obviously wrong! , 2009, ASPLOS.

[77]  Alexandru Iosup,et al.  An Experimental Performance Evaluation of Autoscaling Policies for Complex Workflows , 2017, ICPE.

[78]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[79]  D. Feitelson Experimental Computer Science: the Need for a Cultural Change , 2006 .

[80]  Erik Elmroth,et al.  KPI-Agnostic Control for Fine-Grained Vertical Elasticity , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[81]  Alexandru Iosup,et al.  Benchmarking in the Cloud: What It Should, Can, and Cannot Be , 2012, TPCTC.

[82]  Samuel Kounev,et al.  BUNGEE: An Elasticity Benchmark for Self-Adaptive IaaS Cloud Environments , 2015, 2015 IEEE/ACM 10th International Symposium on Software Engineering for Adaptive and Self-Managing Systems.