FlyMC: Highly Scalable Testing of Complex Interleavings in Distributed Systems

We present a fast and scalable testing approach for datacenter/cloud systems such as Cassandra, Hadoop, Spark, and ZooKeeper. The uniqueness of our approach is in its ability to overcome the path/state-space explosion problem in testing workloads with complex interleavings of messages and faults. We introduce three powerful algorithms: state symmetry, event independence, and parallel flips, which collectively makes our approach on average 16x (up to 78x) faster than other state-of-the-art solutions. We have integrated our techniques with 8 popular datacenter systems, successfully reproduced 12 old bugs, and found 10 new bugs --- all were done without random walks or manual checkpoints.

[1]  Pallavi Joshi,et al.  SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems , 2014, OSDI.

[2]  Yuriy Brun,et al.  Debugging Distributed Systems , 2016, ACM Queue.

[3]  Viktor Kuncak,et al.  CrystalBall: Predicting and Preventing Inconsistencies in Deployed Distributed Systems , 2009, NSDI.

[4]  Nachiappan Nagappan,et al.  Concurrency at Microsoft – An Exploratory Survey , 2008 .

[5]  Ion Stoica,et al.  Friday: Global Comprehension for Distributed Replay , 2007, NSDI.

[6]  Cheng Huang,et al.  Uncovering Bugs in Distributed Storage Systems during Testing (Not in Production!) , 2016, FAST.

[7]  Garth A. Gibson,et al.  dBug: Systematic Evaluation of Distributed Systems , 2010, SSV.

[8]  Parosh Aziz Abdulla,et al.  Optimal dynamic partial order reduction , 2014, POPL.

[9]  Daniel Kroening,et al.  Unfolding-based Partial Order Reduction , 2015, CONCUR.

[10]  Thomas Ball,et al.  Finding and Reproducing Heisenbugs in Concurrent Programs , 2008, OSDI.

[11]  A. Prasad Sistla,et al.  SMC: a symmetry-based model checker for verification of safety and liveness properties , 2000, TSEM.

[12]  George C. Necula,et al.  Minimizing Faulty Executions of Distributed Systems , 2016, NSDI.

[13]  Amin Vahdat,et al.  Life, death, and the critical transition: finding liveness bugs in systems code , 2007 .

[14]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[15]  Pallavi Joshi,et al.  SETSUDŌ: perturbation-based testing framework for scalable distributed systems , 2013, TRIOS@SOSP.

[16]  Kang G. Shin,et al.  On fault resilience of OpenStack , 2013, SoCC.

[17]  Patrice Godefroid,et al.  Model checking for programming languages using VeriSoft , 1997, POPL '97.

[18]  Damien Zufferey,et al.  P: safe asynchronous event-driven programming , 2013, PLDI.

[19]  Shan Lu,et al.  FCatch: Automatically Detecting Time-of-fault Bugs in Cloud Systems , 2018, ASPLOS.

[20]  Junfeng Yang,et al.  Practical software model checking via dynamic interface reduction , 2011, SOSP.

[21]  Thomas A. Limoncelli,et al.  LISA '11: Theme - "DevOps: New Challenges, Proven Values" , 2011, login Usenix Mag..

[22]  Xuezheng Liu,et al.  D3S: Debugging Deployed Distributed Systems , 2008, NSDI.

[23]  Rodrigo Fonseca,et al.  Principled workflow-centric tracing of distributed systems , 2016, SoCC.

[24]  Haoxiang Lin,et al.  MODIST: Transparent Model Checking of Unmodified Distributed Systems , 2009, NSDI.

[25]  Srinath T. V. Setty,et al.  IronFleet: proving practical distributed systems correct , 2015, SOSP.

[26]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[27]  Xi Wang,et al.  An Empirical Study on the Correctness of Formally Verified Distributed Systems , 2017, EuroSys.

[28]  Swarat Chaudhuri,et al.  Symbolic pruning of concurrent program executions , 2009, ESEC/FSE '09.

[29]  Chao Wang,et al.  Coverage guided systematic concurrency testing , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[30]  Architectural Support for Programming Languages and Operating Systems, ASPLOS '14, Salt Lake City, UT, USA, March 1-5, 2014 , 2014, ASPLOS.

[31]  Mike Hibler,et al.  An integrated experimental environment for distributed systems and networks , 2002, OPSR.

[32]  Sebastian Burckhardt,et al.  Effective ? , 2010 .

[33]  Jason Hickey,et al.  Scalable Dynamic Partial Order Reduction , 2012, RV.

[34]  A. Prasad Sistla,et al.  Symmetry Reductions in Model Checking , 1998, CAV.

[35]  A. Prasad Sistla Symmetry Reductions in Model-Checking , 2003, VMCAI.

[36]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[37]  Wei Lin,et al.  WiDS Checker: Combating Bugs in Distributed Systems , 2007, NSDI.

[38]  E. Allen Emerson,et al.  The Beginning of Model Checking: A Personal Perspective , 2008, 25 Years of Model Checking.

[39]  Patrice Godefroid,et al.  SAGE: Whitebox Fuzzing for Security Testing , 2012, ACM Queue.

[40]  Chao Wang,et al.  Monotonic Partial Order Reduction: An Optimal Symbolic Partial Order Reduction Technique , 2009, CAV.

[41]  Rodrigo Fonseca,et al.  Pivot Tracing , 2018, ACM Trans. Comput. Syst..

[42]  Patrice Godefroid,et al.  Billions and billions of constraints: Whitebox fuzz testing in production , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[43]  Patrice Godefroid,et al.  Dynamic partial-order reduction for model checking software , 2005, POPL '05.

[44]  Ding Yuan,et al.  Pensieve: Non-Intrusive Failure Reproduction for Distributed Systems using the Event Chaining Approach , 2017, SOSP.

[45]  Rachid Guerraoui,et al.  Model Checking a Networked System Without the Network , 2011, NSDI.

[46]  Patrice Godefroid Between Testing and Verification : Software Model Checking via Systematic Testing , 2015 .

[47]  Yu Yang,et al.  Distributed Dynamic Partial Order Reduction Based Verification of Threaded Software , 2007, SPIN.

[48]  Xi Wang,et al.  Verdi: a framework for implementing and formally verifying distributed systems , 2015, PLDI.

[49]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[50]  John K. Ousterhout,et al.  In Search of an Understandable Consensus Algorithm , 2014, USENIX ATC.

[51]  Tanakorn Leesatapornwongsa,et al.  What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems , 2014, SoCC.

[52]  Edmund M. Clarke,et al.  Model checking and abstraction , 1994, TOPL.

[53]  Shan Lu,et al.  DCatch: Automatically Detecting Distributed Concurrency Bugs in Cloud Systems , 2017, ASPLOS.

[54]  Andrea C. Arpaci-Dusseau,et al.  FATE and DESTINI: A Framework for Cloud Recovery Testing , 2011, NSDI.

[55]  Patrice Godefroid,et al.  Partial-Order Methods for the Verification of Concurrent Systems , 1996, Lecture Notes in Computer Science.

[56]  Amin Vahdat,et al.  Pip: Detecting the Unexpected in Distributed Systems , 2006, NSDI.

[57]  Shan Lu,et al.  TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems , 2016, ASPLOS.

[58]  Donald Beaver,et al.  Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[59]  Michael D. Ernst,et al.  Challenges and Options for Validation and Debugging , 2016 .