论文信息 - Detecting failures in distributed systems with the Falcon spy network

Detecting failures in distributed systems with the Falcon spy network

A common way for a distributed system to tolerate crashes is to explicitly detect them and then recover from them. Interestingly, detection can take much longer than recovery, as a result of many advances in recovery techniques, making failure detection the dominant factor in these systems' unavailability when a crash occurs. This paper presents the design, implementation, and evaluation of Falcon, a failure detector with several features. First, Falcon's common-case detection time is sub-second, which keeps unavailability low. Second, Falcon is reliable: it never reports a process as down when it is actually up. Third, Falcon sometimes kills to achieve reliable detection but aims to kill the smallest needed component. Falcon achieves these features by coordinating a network of spies, each monitoring a layer of the system. Falcon's main cost is a small amount of platform-specific logic. Falcon is thus the first failure detector that is fast, reliable, and viable. As such, it could change the way that a class of distributed systems is built.

[1] Yair Amir,et al. Paxos for System Builders: an overview , 2008, LADIS '08.

[2] Naohiro Hayashibara,et al. The φ Accrual Failure Detector , 2004 .

[3] 刘锋,et al. Kernel-based virtual machine事件跟踪机制的设计与实现 , 2008 .

[4] Marcos K. Aguilera,et al. No Time for Asynchrony , 2009, HotOS.

[5] Leslie Lamport,et al. The part-time parliament , 1998, TOCS.

[6] George Candea,et al. Microreboot - A Technique for Cheap Recovery , 2004, OSDI.

[7] Wilson C. Hsieh,et al. Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[8] Robert Tappan Morris,et al. Flexible, Wide-Area Storage for Distributed Systems with WheelFS , 2009, NSDI.

[9] Muli Ben-Yehuda,et al. The Turtles Project: Design and Implementation of Nested Virtualization , 2010, OSDI.

[10] Butler W. Lampson,et al. The ABCD's of Paxos , 2001, PODC '01.

[11] Robbert van Renesse,et al. Chain Replication for Supporting High Throughput and Availability , 2004, OSDI.

[12] Dutch T. Meyer,et al. Remus: High Availability via Asynchronous Virtual Machine Replication. (Best Paper) , 2008, NSDI.

[13] Leslie Lamport,et al. Paxos Made Simple , 2001 .

[14] David Mazières. Paxos Made Practical , 2007 .

[15] Nancy A. Lynch,et al. Consensus in the presence of partial synchrony , 1988, JACM.

[16] Brett D. Fleisch,et al. The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[17] Keith Marzullo,et al. Mencius: Building Efficient Replicated State Machine for WANs , 2008, OSDI.

[18] Chandramohan A. Thekkath,et al. Petal: distributed virtual disks , 1996, ASPLOS VII.

[19] Paulo Veríssimo. Uncertainty and predictability: can they be reconciled? , 2003 .

[20] Sanjay Ghemawat,et al. MapReduce: simplified data processing on large clusters , 2008, CACM.

[21] Christof Fetzer,et al. Perfect Failure Detection in Timed Asynchronous Systems , 2003, IEEE Trans. Computers.

[22] Antonio Casimiro,et al. The timely computing base: Timely actions in the presence of uncertain timeliness , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[23] J. D. Day,et al. A principle for resilient sharing of distributed resources , 1976, ICSE '76.

[24] Marcos K. Aguilera,et al. On the quality of service of failure detectors , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[25] Jerome H. Saltzer,et al. End-to-end arguments in system design , 1984, TOCS.

[26] Robert Griesemer,et al. Paxos made live: an engineering perspective , 2007, PODC '07.

[27] Lorenzo Alvisi,et al. The Paxos Register , 2007, 2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007).

[28] Mahadev Konar,et al. ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[29] Antonio Casimiro,et al. The Timely Computing Base Model and Architecture , 2002, IEEE Trans. Computers.

[30] GhemawatSanjay,et al. The Google file system , 2003 .

[31] Sam Toueg,et al. Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[32] Nancy A. Lynch,et al. Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[33] Robbert van Renesse,et al. A Gossip-Style Failure Detection Service , 2009 .

[34] Arun Venkataramani,et al. Consensus Routing: The Internet as a Distributed System. (Best Paper) , 2008, NSDI.

[35] Werner Vogels,et al. Dynamo: amazon's highly available key-value store , 2007, SOSP.

[36] Kenneth P. Birman,et al. Exploiting virtual synchrony in distributed systems , 1987, SOSP '87.

[37] Marcos K. Aguilera,et al. On the Impact of Fast Failure Detectors on Real-Time Fault-Tolerant Systems , 2002, DISC.

[38] Peng Li,et al. Paxos Replicated State Machines as the Basis of a High-Performance Data Store , 2011, NSDI.

[39] Mikel Larrea,et al. On the impossibility of implementing perpetual failure detectors in partially synchronous systems , 2002, Proceedings 10th Euromicro Workshop on Parallel, Distributed and Network-based Processing.

[40] Marc Najork,et al. Boxwood: Abstractions as the Foundation for Storage Infrastructure , 2004, OSDI.

[41] Paulo Veríssimo,et al. Uncertainty and Predictability: Can They Be Reconciled? , 2003, Future Directions in Distributed Computing.

[42] Nancy A. Lynch,et al. Revisiting the PAXOS algorithm , 1997, Theor. Comput. Sci..

[43] Yuan Yu,et al. Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[44] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[45] Pierre Sens,et al. Implementation and performance evaluation of an adaptable failure detector , 2002, Proceedings International Conference on Dependable Systems and Networks.

[46] George Candea,et al. Improving availability with recursive microreboots: a soft-state system case study , 2004, Perform. Evaluation.