Minimizing Faulty Executions of Distributed Systems

When troubleshooting buggy executions of distributed systems, developers typically start by manually separating out events that are responsible for triggering the bug (signal) from those that are extraneous (noise). We present DEMi, a tool for automatically performing this minimization. We apply DEMi to buggy executions of two very different distributed systems, Raft and Spark, and find that it produces minimized executions that are between 1X and 4.6X the size of optimal executions.

[1]  Marcos K. Aguilera,et al.  Heartbeat: A Timeout-Free Failure Detector for Quiescent Reliable Communication , 1997, WDAG.

[2]  Gregg Rothermel,et al.  On the use of delta debugging to reduce recordings and facilitate debugging of web applications , 2015, ESEC/SIGSOFT FSE.

[3]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[4]  Patrice Godefroid,et al.  Partial-Order Methods for the Verification of Concurrent Systems , 1996, Lecture Notes in Computer Science.

[5]  Patrice Godefroid,et al.  Dynamic partial-order reduction for model checking software , 2005, POPL '05.

[6]  Elad Yom-Tov,et al.  Instrumenting where it hurts: an automatic concurrent debugging technique , 2007, ISSTA '07.

[7]  Pierre Wolper,et al.  Expressing interesting properties of programs in propositional temporal logic , 1986, POPL '86.

[8]  Yuanyuan Zhou,et al.  PRES: probabilistic replay with execution sketching on multiprocessors , 2009, SOSP '09.

[9]  Andreas Zeller,et al.  Yesterday, my program worked. Today, it does not. Why? , 1999, ESEC/FSE-7.

[10]  Pravesh Kothari,et al.  A randomized scheduler with probabilistic guarantees of finding bugs , 2010, ASPLOS XV.

[11]  Jeff Huang,et al.  LEAN: simplifying concurrency bug reproduction via replay-supported execution reduction , 2012, OOPSLA '12.

[12]  Thomas Wies,et al.  Flow-Sensitive Fault Localization , 2013, VMCAI.

[13]  Xuejun Yang,et al.  Stateful Dynamic Partial-Order Reduction , 2006, ICFEM.

[14]  Xiangyu Zhang,et al.  Comparative causality: Explaining the differences between executions , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[15]  Jeff Huang,et al.  An Efficient Static Trace Simplification Technique for Debugging Concurrent Programs , 2011, SAS.

[16]  W. K. Chan,et al.  Lock Trace Reduction for Multithreaded Programs , 2013, IEEE Transactions on Parallel and Distributed Systems.

[17]  Mohamed A. El-Zawawy,et al.  An efficient binary technique for trace simplifications of concurrent programs , 2014, 2014 IEEE 6th International Conference on Adaptive Science & Technology (ICAST).

[18]  George Candea,et al.  Debug Determinism: The Sweet Spot for Replay-Based Debugging , 2011, HotOS.

[19]  William G. Griswold,et al.  Dynamically discovering likely program invariants to support program evolution , 1999, Proceedings of the 1999 International Conference on Software Engineering (IEEE Cat. No.99CB37002).

[20]  Martin Monperrus,et al.  Test case purification for improving fault localization , 2014, SIGSOFT FSE.

[21]  Amin Vahdat,et al.  To infinity and beyond: time warped network emulation , 2005, SOSP '05.

[22]  Xiangyu Zhang,et al.  Enabling tracing Of long-running multithreaded programs via dynamic execution reduction , 2007, ISSTA '07.

[23]  Colin Scott,et al.  Troubleshooting blackbox SDN control software with minimal causal sequences , 2015, SIGCOMM.

[24]  Igor L. Markov,et al.  Simulation-based bug trace minimization with BMC-based refinement , 2005, ICCAD-2005. IEEE/ACM International Conference on Computer-Aided Design, 2005..

[25]  Jie Wang,et al.  Fast reproducing web application errors , 2015, 2015 IEEE 26th International Symposium on Software Reliability Engineering (ISSRE).

[26]  Samuel T. King,et al.  ReVirt: enabling intrusion analysis through virtual-machine logging and replay , 2002, OPSR.

[27]  Yuriy Brun,et al.  Leveraging existing instrumentation to automatically infer invariant-constrained models , 2011, ESEC/FSE '11.

[28]  Sai Zhang,et al.  Software bug localization with markov logic , 2014, ICSE Companion.

[29]  George Candea,et al.  Execution synthesis: a technique for automated software debugging , 2010, EuroSys '10.

[30]  Martin Burger,et al.  Minimizing reproduction of software failures , 2011, ISSTA '11.

[31]  Koushik Sen,et al.  A trace simplification technique for effective debugging of concurrent programs , 2010, FSE '10.

[32]  Thomas D. LaToza,et al.  Maintaining mental models: a study of developer work habits , 2006, ICSE.

[33]  William G. Griswold,et al.  An Overview of AspectJ , 2001, ECOOP.

[34]  Jacobus E. van der Merwe,et al.  DEFINED: Deterministic Execution for Interactive Control-Plane Debugging , 2013, USENIX Annual Technical Conference.

[35]  Thomas Ball,et al.  Finding and Reproducing Heisenbugs in Concurrent Programs , 2008, OSDI.

[36]  Jerome A. Feldman,et al.  On the Synthesis of Finite-State Machines from Samples of Their Behavior , 1972, IEEE Transactions on Computers.

[37]  Shing-Chi Cheung,et al.  RECONTEST: Effective Regression Testing of Concurrent Programs , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[38]  Mark Weiser,et al.  Program Slicing , 1981, IEEE Transactions on Software Engineering.

[39]  Koen Claessen,et al.  Finding race conditions in Erlang with QuickCheck and PULSE , 2009, ICFP.

[40]  Viktor Kuncak,et al.  CrystalBall: Predicting and Preventing Inconsistencies in Deployed Distributed Systems , 2009, NSDI.

[41]  Xiangyu Zhang,et al.  Toward generating reducible replay logs , 2011, PLDI '11.

[42]  Brandon Lucia,et al.  Concurrency debugging with differential schedule projections , 2015, PLDI.

[43]  Chao Wang,et al.  ConcBugAssist: constraint solving for diagnosis and repair of concurrency bugs , 2015, ISSTA.

[44]  John T. Stasko,et al.  Visualization of test information to assist fault localization , 2002, ICSE '02.

[45]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[46]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[47]  George Candea,et al.  Failure sketching: a technique for automated root cause diagnosis of in-production failures , 2015, SOSP.

[48]  John M. Hughes,et al.  Testing a database for race conditions with QuickCheck: none , 2011, Erlang '11.

[49]  Andreas Zeller,et al.  Simplifying and Isolating Failure-Inducing Input , 2002, IEEE Trans. Software Eng..

[50]  Sang Min Park,et al.  Effective fault localization techniques for concurrent software , 2014 .

[51]  Rodrigo Rodrigues,et al.  SKI: Exposing Kernel Concurrency Bugs through Systematic Schedule Exploration , 2014, OSDI.

[52]  Sebastian Burckhardt,et al.  Effective ? , 2010 .

[53]  Yuanyuan Zhou,et al.  Triage: diagnosing production run failures at the user's site , 2007, SOSP.

[54]  John K. Ousterhout,et al.  In Search of an Understandable Consensus Algorithm , 2014, USENIX ATC.

[55]  David Brumley,et al.  Enhancing symbolic execution with veritesting , 2014, ICSE.

[56]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[57]  Ding Yuan,et al.  SherLog: error diagnosis by connecting clues from run-time logs , 2010, ASPLOS 2010.

[58]  Rupak Majumdar,et al.  Cause clue clauses: error localization using maximum satisfiability , 2010, PLDI '11.

[59]  Jurriaan Hage,et al.  Guided Algebraic Specification Mining for Failure Simplification , 2013, ICTSS.

[60]  Qiang Fu,et al.  Mining Invariants from Console Logs for System Problem Detection , 2010, USENIX Annual Technical Conference.

[61]  Yu Yang,et al.  Efficient Stateful Dynamic Partial Order Reduction , 2008, SPIN.

[62]  Haoxiang Lin,et al.  MODIST: Transparent Model Checking of Unmodified Distributed Systems , 2009, NSDI.

[63]  Xuejun Yang,et al.  Finding and understanding bugs in C compilers , 2011, PLDI '11.

[64]  Ion Stoica,et al.  ODR: output-deterministic replay for multicore debugging , 2009, SOSP '09.

[65]  Eran Yahav,et al.  Verifying atomicity via data independence , 2014, ISSTA 2014.

[66]  Zhi Liu,et al.  Troubleshooting blackbox SDN control software with minimal causal sequences , 2014 .

[67]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[68]  Pallavi Joshi,et al.  SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems , 2014, OSDI.

[69]  Ding Yuan,et al.  SherLog: error diagnosis by connecting clues from run-time logs , 2010, ASPLOS XV.

[70]  Koen Claessen,et al.  QuickCheck: a lightweight tool for random testing of Haskell programs , 2000, ICFP.

[71]  Jong-Deok Choi,et al.  Isolating failure-inducing thread schedules , 2002, ISSTA '02.

[72]  Alessandro Orso,et al.  F3: fault localization for field failures , 2013, ISSTA.

[73]  Garth A. Gibson,et al.  dBug: Systematic Evaluation of Distributed Systems , 2010, SSV.

[74]  Alessandro Orso,et al.  A Technique for Enabling and Supporting Debugging of Field Failures , 2007, 29th International Conference on Software Engineering (ICSE'07).

[75]  H. Cleve,et al.  Locating causes of program failures , 2005, Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005..

[76]  ChenYang,et al.  Test-case reduction for C compiler bugs , 2012 .

[77]  Yuanyuan Zhou,et al.  CTrigger: exposing atomicity violation bugs from their hiding places , 2009, ASPLOS.

[78]  Leonardo Mariani,et al.  Automatic generation of software behavioral models , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[79]  Steven D. Gribble,et al.  Configuration Debugging as Search: Finding the Needle in the Haystack , 2004, OSDI.

[80]  Madan Musuvathi,et al.  Iterative context bounding for systematic testing of multithreaded programs , 2007, PLDI '07.

[81]  Darko Marinov,et al.  Evaluating Ordering Heuristics for Dynamic Partial-Order Reduction Techniques , 2010, FASE.

[82]  Yuriy Brun,et al.  Inferring models of concurrent systems from logs of their behavior with CSight , 2014, ICSE.

[83]  G. A. Miller THE PSYCHOLOGICAL REVIEW THE MAGICAL NUMBER SEVEN, PLUS OR MINUS TWO: SOME LIMITS ON OUR CAPACITY FOR PROCESSING INFORMATION 1 , 1956 .

[84]  John Hughes,et al.  Testing telecoms software with quviq QuickCheck , 2006, ERLANG '06.

[85]  Maysam Yabandeh,et al.  DPOR-DS: Dynamic Partial Order Reduction in Distributed Systems , 2009 .

[86]  Scott Shenker,et al.  Replay debugging for distributed applications , 2006 .

[87]  Nachiappan Nagappan,et al.  Concurrency at Microsoft – An Exploratory Survey , 2008 .