论文信息 - An empirical study on crash recovery bugs in large-scale distributed systems

An empirical study on crash recovery bugs in large-scale distributed systems

In large-scale distributed systems, node crashes are inevitable, and can happen at any time. As such, distributed systems are usually designed to be resilient to these node crashes via various crash recovery mechanisms, such as write-ahead logging in HBase and hinted handoffs in Cassandra. However, faults in crash recovery mechanisms and their implementations can introduce intricate crash recovery bugs, and lead to severe consequences. In this paper, we present CREB, the most comprehensive study on 103 Crash REcovery Bugs from four popular open-source distributed systems, including ZooKeeper, Hadoop MapReduce, Cassandra and HBase. For all the studied bugs, we analyze their root causes, triggering conditions, bug impacts and fixing. Through this study, we obtain many interesting findings that can open up new research directions for combating crash recovery bugs.

[1] Carlo Curino,et al. Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[2] Xuezheng Liu,et al. D3S: Debugging Deployed Distributed Systems , 2008, NSDI.

[3] Andrea C. Arpaci-Dusseau,et al. All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications , 2014, OSDI.

[4] Adam Chlipala,et al. Using Crash Hoare logic for certifying the FSCQ file system , 2015, USENIX Annual Technical Conference.

[5] Brett D. Fleisch,et al. The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[6] GhemawatSanjay,et al. The Google file system , 2003 .

[7] Julian Stanley. The two biggest NQT challenges , 2017 .

[8] Junfeng Yang,et al. Using model checking to find serious file system errors , 2004, TOCS.

[9] Feng Li,et al. CloudRaid: hunting concurrency bugs in the cloud via log-mining , 2018, ESEC/SIGSOFT FSE.

[10] Wei Xu,et al. What Can We Learn from Four Years of Data Center Hardware Failures? , 2017, 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[11] Shan Lu,et al. Understanding Real-World Timeout Problems in Cloud Server Systems , 2018, 2018 IEEE International Conference on Cloud Engineering (IC2E).

[12] Yu Luo,et al. Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems , 2014, OSDI.

[13] Haoxiang Lin,et al. MODIST: Transparent Model Checking of Unmodified Distributed Systems , 2009, NSDI.

[14] Srinath T. V. Setty,et al. IronFleet: proving practical distributed systems correct , 2015, SOSP.

[15] Adam Chlipala,et al. Chapar: certified causally consistent distributed key-value stores , 2016, POPL.

[16] Shan Lu,et al. FCatch: Automatically Detecting Time-of-fault Bugs in Cloud Systems , 2018, ASPLOS.

[17] Xi Wang,et al. Verdi: a framework for implementing and formally verifying distributed systems , 2015, PLDI.

[18] Mahadev Konar,et al. ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[19] George C. Necula,et al. Minimizing Faulty Executions of Distributed Systems , 2016, NSDI.

[20] Wilson C. Hsieh,et al. Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[21] Andrea C. Arpaci-Dusseau,et al. SQCK: A Declarative File System Checker , 2008, OSDI.

[22] Patrice Godefroid,et al. Dynamic partial-order reduction for model checking software , 2005, POPL '05.