Is Big Data Performance Reproducible in Modern Cloud Networks?

Performance variability has been acknowledged as a problem for over a decade by cloud practitioners and performance engineers. Yet, our survey of top systems conferences reveals that the research community regularly disregards variability when running experiments in the cloud. Focusing on networks, we assess the impact of variability on cloud-based big-data workloads by gathering traces from mainstream commercial clouds and private research clouds. Our data collection consists of millions of datapoints gathered while transferring over 9 petabytes of data. We characterize the network variability present in our data and show that, even though commercial cloud providers implement mechanisms for quality-of-service enforcement, variability still occurs, and is even exacerbated by such mechanisms and service provider policies. We show how big-data workloads suffer from significant slowdowns and lack predictability and replicability, even when state-of-the-art experimentation techniques are used. We provide guidelines for practitioners to reduce the volatility of big data performance, making experiments more repeatable.

[1]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[2]  Tim Brecht,et al.  Conducting Repeatable Experiments in Highly Variable Cloud Computing Environments , 2017, ICPE.

[3]  Zhen Cao,et al.  On the Performance Variation in Modern Storage Stacks , 2017, FAST.

[4]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[5]  T. S. Eugene Ng,et al.  The Impact of Virtualization on Network Performance of Amazon EC2 Data Center , 2010, 2010 Proceedings IEEE INFOCOM.

[6]  Jorge-Arnulfo Quiané-Ruiz,et al.  Runtime measurements in the cloud , 2010, Proc. VLDB Endow..

[7]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[8]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[9]  Benjamin Farley,et al.  More for your money: exploiting performance heterogeneity in public clouds , 2012, SoCC '12.

[10]  Tomas Kalibera,et al.  Rigorous benchmarking in reasonable time , 2013, ISMM '13.

[11]  Bingsheng He,et al.  Finding Constant from Change: Revisiting Network Performance Aware Optimizations on IaaS Clouds , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Helen J. Wang,et al.  SecondNet: a data center network virtualization architecture with bandwidth guarantees , 2010, CoNEXT.

[13]  Robert Latham,et al.  Modeling I/O Performance Variability Using Conditional Variational Autoencoders , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[14]  A. Rowstron,et al.  Towards predictable datacenter networks , 2011, SIGCOMM.

[15]  Subhabrata Chakraborti,et al.  Nonparametric Statistical Inference , 2011, International Encyclopedia of Statistical Science.

[16]  Laxmikant V. Kalé,et al.  Identifying the Culprits Behind Network Congestion , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[17]  Mirco Tribastone,et al.  Modelling exogenous variability in cloud deployments , 2013, PERV.

[18]  Dick H. J. Epema,et al.  Reducing Job Slowdown Variability for Data-Intensive Workloads , 2015, 2015 IEEE 23rd International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.

[19]  Lorenzo Donatiello,et al.  Performance Evaluation of Computer and Communication Systems , 1993, Lecture Notes in Computer Science.

[20]  A. Viera,et al.  Understanding interobserver agreement: the kappa statistic. , 2005, Family medicine.

[21]  Philipp Leitner,et al.  Patterns in the Chaos—A Study of Performance Variation and Predictability in Public IaaS Clouds , 2014, ACM Trans. Internet Techn..

[22]  Robert N. M. Watson,et al.  Queues Don't Matter When You Can JUMP Them! , 2015, NSDI.

[23]  George Kesidis,et al.  Using Burstable Instances in the Public Cloud: Why, When and How? , 2017, SIGMETRICS.

[24]  Torsten Hoefler,et al.  Scientific Benchmarking of Parallel Computing Systems Twelve ways to tell the masses when reporting performance results , 2017 .

[25]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[26]  Allen D. Malony,et al.  Scaling Spark on HPC Systems , 2016, HPDC.

[27]  Andrea C. Arpaci-Dusseau,et al.  The Popper Convention: Making Reproducible Systems Evaluation Practical , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[28]  W. Fuller,et al.  Distribution of the Estimators for Autoregressive Time Series with a Unit Root , 1979 .

[29]  Matthias Hauswirth,et al.  Why you should care about quantile regression , 2013, ASPLOS '13.

[30]  Robert Ricci,et al.  Taming Performance Variability , 2018, OSDI.

[31]  Raghunath Othayoth Nambiar,et al.  The making of TPC-DS , 2006, VLDB.

[32]  Gu-Yeon Wei,et al.  Profiling a warehouse-scale computer , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[33]  Editors , 1986, Brain Research Bulletin.

[34]  Andreas Zeller,et al.  The Truth, The Whole Truth, and Nothing But the Truth , 2016, ACM Trans. Program. Lang. Syst..

[35]  Animesh Trivedi,et al.  Albis: High-Performance File Format for Big Data Systems , 2018, USENIX Annual Technical Conference.

[36]  S. Shapiro,et al.  An Analysis of Variance Test for Normality (Complete Samples) , 1965 .

[37]  P. Alam,et al.  R , 1823, The Herodotus Encyclopedia.

[38]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[39]  Sriram Ramabhadran,et al.  Cloud control with distributed rate limiting , 2007, SIGCOMM '07.

[40]  Katherine E. Isaacs,et al.  There goes the neighborhood: Performance degradation due to nearby jobs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[41]  Scott Shenker,et al.  Making Sense of Performance in Data Analytics Frameworks , 2015, NSDI.

[42]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[43]  Raj Jain,et al.  The Art of Computer Systems Performance Analysis : Tech-niques for Experimental Design , 1991 .

[44]  Dorgival O. Guedes,et al.  Gatekeeper: Supporting Bandwidth Guarantees for Multi-tenant Datacenter Networks , 2011, WIOV.

[45]  Alexandru Iosup,et al.  On the Performance Variability of Production Cloud Services , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[46]  Hari Balakrishnan,et al.  Choreo: network-aware task placement for cloud applications , 2013, Internet Measurement Conference.

[47]  Alexandru Uta,et al.  A Performance Study of Big Data Workloads in Cloud Datacenters with Network Variability , 2018, ICPE Companion.

[48]  Antonio Pescapè,et al.  Measuring network throughput in the cloud: The case of Amazon EC2 , 2015, Comput. Networks.

[49]  Tim Kraska,et al.  An evaluation of alternative architectures for transaction processing in the cloud , 2010, SIGMOD Conference.

[50]  Emery D. Berger,et al.  STABILIZER: statistically sound performance evaluation , 2013, ASPLOS '13.

[51]  David E. Culler,et al.  Enabling Computer and Information Science and Engineering Research and Education in the Cloud , 2018 .

[52]  Anja Feldmann,et al.  C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection , 2015, NSDI.

[53]  Alexandru Iosup,et al.  Massivizing Computer Systems: A Vision to Understand, Design, and Engineer Computer Ecosystems Through and Beyond Modern Distributed Systems , 2018, 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS).

[54]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[55]  Xiaowei Yang,et al.  CloudCmp: comparing public cloud providers , 2010, IMC '10.

[56]  Nan Hua,et al.  Andromeda: Performance, Isolation, and Velocity at Scale in Cloud Network Virtualization , 2018, NSDI.

[57]  Lucian Popa,et al.  What we talk about when we talk about cloud network performance , 2012, CCRV.

[58]  Raj Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.