Automating Failure Testing Research at Internet Scale

Large-scale distributed systems must be built to anticipate and mitigate a variety of hardware and software failures. In order to build confidence that fault-tolerant systems are correctly implemented, Netflix (and similar enterprises) regularly run failure drills in which faults are deliberately injected in their production system. The combinatorial space of failure scenarios is too large to explore exhaustively. Existing failure testing approaches either randomly explore the space of potential failures randomly or exploit the "hunches" of domain experts to guide the search. Random strategies waste resources testing "uninteresting" faults, while programmer-guided approaches are only as good as human intuition and only scale with human effort. In this paper, we describe how we adapted and implemented a research prototype called lineage-driven fault injection (LDFI) to automate failure testing at Netflix. Along the way, we describe the challenges that arose adapting the LDFI model to the complex and dynamic realities of the Netflix architecture. We show how we implemented the adapted algorithm as a service atop the existing tracing and fault injection infrastructure, and present early results.

[1]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.

[2]  Thomas Ball,et al.  Finding and Reproducing Heisenbugs in Concurrent Programs , 2008, OSDI.

[3]  Farnam Jahanian,et al.  ORCHESTRA: A Fault Injection Environment for Distributed Systems , 1996 .

[4]  Brad Fitzpatrick,et al.  Distributed caching with memcached , 2004 .

[5]  Andrea C. Arpaci-Dusseau,et al.  FATE and DESTINI: A Framework for Cloud Recovery Testing , 2011, NSDI.

[6]  Leslie Lamport,et al.  Model Checking TLA+ Specifications , 1999, CHARME.

[7]  Joseph M. Hellerstein,et al.  Consistency Analysis in Bloom: a CALM and Collected Approach , 2011, CIDR.

[8]  Dawson R. Engler,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation Cmc: a Pragmatic Approach to Model Checking Real Code , 2022 .

[9]  Jacob A. Abraham,et al.  FERRARI: A Flexible Software-Based Fault and Error Injection System , 1995, IEEE Trans. Computers.

[10]  Fan Zhang,et al.  Use of Formal Methods at Amazon Web Services , 2014 .

[11]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[12]  Joseph M. Hellerstein,et al.  Lineage-driven Fault Injection , 2015, SIGMOD Conference.

[13]  Thomas F. Wenisch,et al.  The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services , 2014, OSDI.

[14]  Ruud C. M. de Rooij,et al.  Chaos Engineering , 2017, IEEE Software.

[15]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[16]  Bertram Ludäscher,et al.  Towards Constraint Provenance Games , 2014, TAPP.

[17]  James Cheney,et al.  Provenance in Databases: Why, How, and Where , 2009, Found. Trends Databases.

[18]  Andreas Haeberlen,et al.  Answering why-not queries in software-defined networks with negative provenance , 2013, HotNets.

[19]  Haoxiang Lin,et al.  MODIST: Transparent Model Checking of Unmodified Distributed Systems , 2009, NSDI.

[20]  Dan Suciu,et al.  Tiresias: the database oracle for how-to queries , 2012, SIGMOD Conference.

[21]  Bertram Ludäscher,et al.  First-Order Provenance Games , 2013, In Search of Elegance in the Theory and Practice of Computation.

[22]  David Maier,et al.  Dedalus: Datalog in Time and Space , 2010, Datalog.

[23]  Jennifer Widom,et al.  Tracing the lineage of view data in a warehousing environment , 2000, TODS.

[24]  Farnam Jahanian,et al.  ORCHESTRA: a probing and fault injection environment for testing protocol implementations , 1996, Proceedings of IEEE International Computer Performance and Dependability Symposium.

[25]  Amin Vahdat,et al.  Life, death, and the critical transition: finding liveness bugs in systems code , 2007 .

[26]  Ion Stoica,et al.  Failure as a Service (FaaS): A Cloud Service for Large- Scale, Online Failure Drills , 2011 .

[27]  Donald Beaver,et al.  Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[28]  Dana Fisman,et al.  On Verifying Fault Tolerance of Distributed Protocols , 2008, TACAS.

[29]  Gerard J. Holzmann,et al.  The SPIN Model Checker - primer and reference manual , 2003 .