论文信息 - Correlated Crash Vulnerabilities

Correlated Crash Vulnerabilities

Modern distributed storage systems employ complex protocols to update replicated data. In this paper, we study whether such update protocols work correctly in the presence of correlated crashes. We find that the correctness of such protocols hinges on how local file-system state is updated by each replica in the system. We build PACE, a framework that systematically generates and explores persistent states that can occur in a distributed execution. PACE uses a set of generic rules to effectively prune the state space, reducing checking time from days to hours in some cases. We apply PACE to eight widely used distributed storage systems to find correlated crash vulnerabilities, i.e., problems in the update protocol that lead to user-level guarantee violations. PACE finds a total of 26 vulnerabilities across eight systems, many of which lead to severe consequences such as data loss, corrupted data, or unavailable clusters.

[1] Mark Lillibridge,et al. Torturing Databases for Fun and Profit , 2014, OSDI.

[2] Fred B. Schneider,et al. Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[3] Butler W. Lampson,et al. Crash Recovery in a Distributed Data Storage System , 1981 .

[4] Barbara Liskov,et al. Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems , 1999, PODC '88.

[5] Eric Eide,et al. Introducing CloudLab: Scientific Infrastructure for Advancing Cloud Architectures and Applications , 2014, login Usenix Mag..

[6] Junfeng Yang,et al. Practical software model checking via dynamic interface reduction , 2011, SOSP.

[7] Andrea C. Arpaci-Dusseau,et al. Towards efficient, portable application-level consistency , 2013, HotDep.

[8] Marco Canini,et al. Checking for Insidious Faults in Deployed Federated and Heterogeneous Distributed Systems , 2011 .

[9] Maysam Yabandeh,et al. DPOR-DS: Dynamic Partial Order Reduction in Distributed Systems , 2009 .

[10] Emina Torlak,et al. Specifying and Checking File System Crash-Consistency Models , 2016, International Conference on Architectural Support for Programming Languages and Operating Systems.

[11] Werner Vogels,et al. Dynamo: amazon's highly available key-value store , 2007, SOSP.

[12] Miguel Castro,et al. Farsite: federated, available, and reliable storage for an incompletely trusted environment , 2002, OPSR.

[13] Marvin Theimer,et al. Managing update conflicts in Bayou, a weakly connected replicated storage system , 1995, SOSP.

[14] A. Fleischmann. Distributed Systems , 1994, Springer Berlin Heidelberg.

[15] Andrea C. Arpaci-Dusseau,et al. Crash Consistency , 2015, ACM Queue.

[16] Tony Tung,et al. Scaling Memcache at Facebook , 2013, NSDI.

[17] Ion Stoica,et al. Friday: Global Comprehension for Distributed Replay , 2007, NSDI.

[18] Ozalp Babaoglu,et al. Consistent global states of distributed systems: fundamental concepts and mechanisms , 1993 .

[19] Wilson C. Hsieh,et al. Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[20] Srinivasan Seshan,et al. Subtleties in Tolerating Correlated Failures in Wide-area Storage Systems , 2006, NSDI.

[21] Andreas Haeberlen,et al. Glacier: highly durable, decentralized storage despite massive correlated failures , 2005, NSDI.

[22] John K. Ousterhout,et al. In Search of an Understandable Consensus Algorithm , 2014, USENIX ATC.