Scalability Bugs: When 100-Node Testing is Not Enough

We highlight the problem of scalability bugs, a new class of bugs that appear in "cloud-scale" distributed systems. Scalability bugs are latent bugs that are cluster-scale dependent, whose symptoms typically surface in large-scale deployments, but not in small or medium-scale deployments. The standard practice to test large distributed systems is to deploy them on a large number of machines ("real-scale testing"), which is difficult and expensive. New methods are needed to reduce developers' burdens in finding, reproducing, and debugging scalability bugs. We propose "scale check," an approach that helps developers find and replay scalability bugs at real scales, but do so only on one machine and still achieve a high accuracy (i.e., similar observed behaviors as if the nodes are deployed in real-scale testing).

[1]  Haryadi S. Gunawi,et al.  Why Does the Cloud Stop Computing?: Lessons from Hundreds of Service Outages , 2016, SoCC.

[2]  Mike Hibler,et al.  An integrated experimental environment for distributed systems and networks , 2002, OPSR.

[3]  Tanakorn Leesatapornwongsa,et al.  What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems , 2014, SoCC.

[4]  Thomas F. Wenisch,et al.  The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services , 2014, OSDI.

[5]  Yu Luo,et al.  Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems , 2014, OSDI.

[6]  Amin Vahdat,et al.  DieCast: Testing Distributed Systems with an Accurate Scale Model , 2008, TOCS.

[7]  Michael I. Jordan,et al.  Characterizing, modeling, and generating workload spikes for stateful services , 2010, SoCC '10.

[8]  Yang Wang,et al.  Exalt: Empowering Researchers to Evaluate Large-Scale Storage Systems , 2014, NSDI.

[9]  Pallavi Joshi,et al.  SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems , 2014, OSDI.

[10]  Bowen Zhou,et al.  Vrisha: using scaling properties of parallel programs for bug detection and localization , 2011, HPDC '11.

[11]  Garth A. Gibson,et al.  PRObE: A Thousand-Node Experimental Cluster for Computer Systems Research , 2013, login Usenix Mag..

[12]  Amin Vahdat,et al.  To infinity and beyond: time warped network emulation , 2005, SOSP '05.

[13]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[14]  Yuanyuan Zhou,et al.  Early Detection of Configuration Errors to Reduce Failure Damage , 2016, USENIX Annual Technical Conference.

[15]  Tanakorn Leesatapornwongsa,et al.  Limplock: understanding the impact of limpware on scale-out cloud systems , 2013, SoCC.

[16]  Torsten Hoefler,et al.  Using automated performance modeling to find scalability bugs in complex codes , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[17]  David Wolinsky,et al.  Heading Off Correlated Failures through Independence-as-a-Service , 2014, OSDI.

[18]  David E. Culler,et al.  SEDA: an architecture for well-conditioned, scalable internet services , 2001, SOSP.

[19]  Silas Boyd-Wickizer,et al.  Securing Distributed Systems with Information Flow Control , 2008, NSDI.

[20]  Yingwei Luo,et al.  Failure Recovery: When the Cure Is Worse Than the Disease , 2013, HotOS.

[21]  John K. Ousterhout Is scale your enemy, or is scale your friend?: technical perspective , 2011, CACM.

[22]  Srinath T. V. Setty,et al.  IronFleet: proving practical distributed systems correct , 2015, SOSP.

[23]  Martin Schulz,et al.  Debugging high-performance computing applications at massive scales , 2015, Commun. ACM.

[24]  Shan Lu,et al.  TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems , 2016, ASPLOS.

[25]  Tanakorn Leesatapornwongsa,et al.  The Case for Drill-Ready Cloud Computing , 2014, SoCC.

[26]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.