An empirical study on crash recovery bugs in large-scale distributed systems

In large-scale distributed systems, node crashes are inevitable, and can happen at any time. As such, distributed systems are usually designed to be resilient to these node crashes via various crash recovery mechanisms, such as write-ahead logging in HBase and hinted handoffs in Cassandra. However, faults in crash recovery mechanisms and their implementations can introduce intricate crash recovery bugs, and lead to severe consequences. In this paper, we present CREB, the most comprehensive study on 103 Crash REcovery Bugs from four popular open-source distributed systems, including ZooKeeper, Hadoop MapReduce, Cassandra and HBase. For all the studied bugs, we analyze their root causes, triggering conditions, bug impacts and fixing. Through this study, we obtain many interesting findings that can open up new research directions for combating crash recovery bugs.

[1]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[2]  Xuezheng Liu,et al.  D3S: Debugging Deployed Distributed Systems , 2008, NSDI.

[3]  Andrea C. Arpaci-Dusseau,et al.  All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications , 2014, OSDI.

[4]  Adam Chlipala,et al.  Using Crash Hoare logic for certifying the FSCQ file system , 2015, USENIX Annual Technical Conference.

[5]  Brett D. Fleisch,et al.  The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[6]  GhemawatSanjay,et al.  The Google file system , 2003 .

[7]  Julian Stanley The two biggest NQT challenges , 2017 .

[8]  Junfeng Yang,et al.  Using model checking to find serious file system errors , 2004, TOCS.

[9]  Feng Li,et al.  CloudRaid: hunting concurrency bugs in the cloud via log-mining , 2018, ESEC/SIGSOFT FSE.

[10]  Wei Xu,et al.  What Can We Learn from Four Years of Data Center Hardware Failures? , 2017, 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[11]  Shan Lu,et al.  Understanding Real-World Timeout Problems in Cloud Server Systems , 2018, 2018 IEEE International Conference on Cloud Engineering (IC2E).

[12]  Yu Luo,et al.  Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems , 2014, OSDI.

[13]  Haoxiang Lin,et al.  MODIST: Transparent Model Checking of Unmodified Distributed Systems , 2009, NSDI.

[14]  Srinath T. V. Setty,et al.  IronFleet: proving practical distributed systems correct , 2015, SOSP.

[15]  Adam Chlipala,et al.  Chapar: certified causally consistent distributed key-value stores , 2016, POPL.

[16]  Shan Lu,et al.  FCatch: Automatically Detecting Time-of-fault Bugs in Cloud Systems , 2018, ASPLOS.

[17]  Xi Wang,et al.  Verdi: a framework for implementing and formally verifying distributed systems , 2015, PLDI.

[18]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[19]  George C. Necula,et al.  Minimizing Faulty Executions of Distributed Systems , 2016, NSDI.

[20]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[21]  Andrea C. Arpaci-Dusseau,et al.  SQCK: A Declarative File System Checker , 2008, OSDI.

[22]  Patrice Godefroid,et al.  Dynamic partial-order reduction for model checking software , 2005, POPL '05.

[23]  Yingwei Luo,et al.  Failure Recovery: When the Cure Is Worse Than the Disease , 2013, HotOS.

[24]  Junfeng Yang,et al.  Reducing crash recoverability to reachability , 2016, POPL.

[25]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[26]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[27]  Andreas Zeller,et al.  Simplifying and Isolating Failure-Inducing Input , 2002, IEEE Trans. Software Eng..

[28]  Andrea C. Arpaci-Dusseau,et al.  Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions , 2017, FAST.

[29]  Koushik Sen,et al.  PREFAIL: a programmable tool for multiple-failure injection , 2011, OOPSLA '11.

[30]  Pallavi Joshi,et al.  SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems , 2014, OSDI.

[31]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[32]  Naohiro Hayashibara,et al.  The φ Accrual Failure Detector , 2004 .

[33]  Pallavi Joshi,et al.  SETSUDŌ: perturbation-based testing framework for scalable distributed systems , 2013, TRIOS@SOSP.

[34]  Joseph M. Hellerstein,et al.  Lineage-driven Fault Injection , 2015, SIGMOD Conference.

[35]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[36]  Shan Lu,et al.  TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems , 2016, ASPLOS.

[37]  Mark Lillibridge,et al.  Torturing Databases for Fun and Profit , 2014, OSDI.

[38]  Junfeng Yang,et al.  Practical software model checking via dynamic interface reduction , 2011, SOSP.

[39]  Tanakorn Leesatapornwongsa,et al.  What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems , 2014, SoCC.

[40]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[41]  Xi Wang,et al.  An Empirical Study on the Correctness of Formally Verified Distributed Systems , 2017, EuroSys.

[42]  Andrea C. Arpaci-Dusseau,et al.  Correlated Crash Vulnerabilities , 2016, OSDI.

[43]  Koushik Sen,et al.  Automated Systematic Testing of Open Distributed Programs , 2006, FASE.

[44]  Shan Lu,et al.  DCatch: Automatically Detecting Distributed Concurrency Bugs in Cloud Systems , 2017, ASPLOS.

[45]  Andrea C. Arpaci-Dusseau,et al.  FATE and DESTINI: A Framework for Cloud Recovery Testing , 2011, NSDI.