Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems

Large, production quality distributed systems still fail periodically, and do so sometimes catastrophically, where most or all users experience an outage or data loss. We present the result of a comprehensive study investigating 198 randomly selected, user-reported failures that occurred on Cassandra, HBase, Hadoop Distributed File System (HDFS), Hadoop MapReduce, and Redis, with the goal of understanding how one or multiple faults eventually evolve into a user-visible failure. We found that from a testing point of view, almost all failures require only 3 or fewer nodes to reproduce, which is good news considering that these services typically run on a very large number of nodes. However, multiple inputs are needed to trigger the failures with the order between them being important. Finally, we found the error logs of these systems typically contain sufficient data on both the errors and the input events that triggered the failure, enabling the diagnose and the reproduction of the production failures. We found the majority of catastrophic failures could easily have been prevented by performing simple testing on error handling code - the last line of defense - even without an understanding of the software design. We extracted three simple rules from the bugs that have lead to some of the catastrophic failures, and developed a static checker, Aspirator, capable of locating these bugs. Over 30% of the catastrophic failures would have been prevented had Aspirator been used and the identified bugs fixed. Running Aspirator on the code of 9 distributed systems located 143 bugs and bad practices that have been fixed or confirmed by the developers.

[1]  Dawson R. Engler,et al.  KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs , 2008, OSDI.

[2]  Yuanyuan Zhou,et al.  CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating System Code , 2004, OSDI.

[3]  Wei Lin,et al.  A characteristic study on failures of production distributed data-parallel programs , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[4]  Junfeng Yang,et al.  An empirical study of operating systems errors , 2001, SOSP.

[5]  Dawson R. Engler,et al.  Checking system rules using system-specific, programmer-written compiler extensions , 2000, OSDI.

[6]  Ratul Mahajan,et al.  Understanding BGP misconfiguration , 2002, SIGCOMM '02.

[7]  Jim Gray,et al.  Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[8]  Andrea C. Arpaci-Dusseau,et al.  A Study of Linux File System Evolution , 2013, FAST.

[9]  Junfeng Yang,et al.  Practical software model checking via dynamic interface reduction , 2011, SOSP.

[10]  Joshua J. Bloch Effective Java, 2nd Edition , 2008, The Java series ... from the source.

[11]  Andrea C. Arpaci-Dusseau,et al.  Error propagation analysis for file systems , 2009, PLDI '09.

[12]  Chen Fu,et al.  Exception-Chain Analysis: Revealing Exception Handling Architecture in Java Server Applications , 2007, 29th International Conference on Software Engineering (ICSE'07).

[13]  Randy H. Katz,et al.  How Hadoop Clusters Break , 2013, IEEE Software.

[14]  Junfeng Yang,et al.  Parrot: a practical runtime for deterministic, stable, and reliable threads , 2013, SOSP.

[15]  Yuriy Brun,et al.  Mining temporal invariants from partially ordered logs , 2011, OPSR.

[16]  Jason Nieh,et al.  Transparent, lightweight application execution replay on commodity multiprocessor operating systems , 2010, SIGMETRICS '10.

[17]  Christophe Calvès,et al.  Faults in linux: ten years later , 2011, ASPLOS XVI.

[18]  Luis Ceze,et al.  Deterministic Process Groups in dOS , 2010, OSDI.

[19]  Pallavi Joshi,et al.  SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems , 2014, OSDI.

[20]  Josep Torrellas,et al.  DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Ef?ciently , 2008, International Symposium on Computer Architecture.

[21]  Mark Sullivan,et al.  Software defects and their impact on system availability-a study of field failures in operating systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[22]  Jean-Claude Laprie,et al.  Dependable computing: concepts, limits, challenges , 1995 .

[23]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[24]  Xuezheng Liu,et al.  Usenix Association 8th Usenix Symposium on Operating Systems Design and Implementation R2: an Application-level Kernel for Record and Replay , 2022 .

[25]  Xiao Ma,et al.  An empirical study on configuration errors in commercial and open source systems , 2011, SOSP.

[26]  Haoxiang Lin,et al.  MODIST: Transparent Model Checking of Unmodified Distributed Systems , 2009, NSDI.

[27]  Amin Vahdat,et al.  Life, death, and the critical transition: finding liveness bugs in systems code , 2007 .

[28]  Yuanyuan Zhou,et al.  Do not blame users for misconfigurations , 2013, SOSP.

[29]  NiehJason,et al.  Transparent, lightweight application execution replay on commodity multiprocessor operating systems , 2010 .

[30]  Satish Narayanasamy,et al.  DoublePlay: parallelizing sequential logging and replay , 2011, ASPLOS XVI.

[31]  Nick Feamster,et al.  Detecting BGP configuration faults with static analysis , 2005 .

[32]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[33]  George Candea,et al.  The S2E Platform: Design, Implementation, and Applications , 2012, TOCS.

[34]  Archana Ganapathi,et al.  Why Do Internet Services Fail, and What Can Be Done About It? , 2002, USENIX Symposium on Internet Technologies and Systems.

[35]  Yang Liu,et al.  Be conservative: enhancing failure diagnosis with proactive logging , 2012, OSDI 2012.

[36]  Jennifer Neville,et al.  Structured Comparative Analysis of Systems Logs to Diagnose Performance Problems , 2012, NSDI.

[37]  George Candea,et al.  Efficient Testing of Recovery Code Using Fault Injection , 2011, TOCS.

[38]  Kashi Venkatesh Vishwanath,et al.  Characterizing cloud computing hardware reliability , 2010, SoCC '10.

[39]  Lorenzo Keller,et al.  ConfErr: A tool for assessing resilience to human configuration errors , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[40]  Josep Torrellas,et al.  DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Ef?ciently , 2008, 2008 International Symposium on Computer Architecture.

[41]  Andrea C. Arpaci-Dusseau,et al.  EIO: Error Handling is Occasionally Correct , 2008, FAST.

[42]  Richard P. Martin,et al.  Understanding and Dealing with Operator Mistakes in Internet Services , 2004, OSDI.

[43]  Yuriy Brun,et al.  Unifying FSM-inference algorithms through declarative specification , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[44]  Radu Banabic,et al.  An Extensible Technique for High-Precision Testing of Recovery Code , 2010, USENIX Annual Technical Conference.

[45]  Paramvir Bahl,et al.  Detailed diagnosis in enterprise networks , 2009, SIGCOMM '09.

[46]  Samuel T. King,et al.  ReVirt: enabling intrusion analysis through virtual-machine logging and replay , 2002, OPSR.

[47]  Yuanyuan Zhou,et al.  Learning from mistakes: a comprehensive study on real world concurrency bug characteristics , 2008, ASPLOS.

[48]  Andrea C. Arpaci-Dusseau,et al.  FATE and DESTINI: A Framework for Cloud Recovery Testing , 2011, NSDI.