Detecting failures in distributed systems with the Falcon spy network

A common way for a distributed system to tolerate crashes is to explicitly detect them and then recover from them. Interestingly, detection can take much longer than recovery, as a result of many advances in recovery techniques, making failure detection the dominant factor in these systems' unavailability when a crash occurs. This paper presents the design, implementation, and evaluation of Falcon, a failure detector with several features. First, Falcon's common-case detection time is sub-second, which keeps unavailability low. Second, Falcon is reliable: it never reports a process as down when it is actually up. Third, Falcon sometimes kills to achieve reliable detection but aims to kill the smallest needed component. Falcon achieves these features by coordinating a network of spies, each monitoring a layer of the system. Falcon's main cost is a small amount of platform-specific logic. Falcon is thus the first failure detector that is fast, reliable, and viable. As such, it could change the way that a class of distributed systems is built.

[1]  Yair Amir,et al.  Paxos for System Builders: an overview , 2008, LADIS '08.

[2]  Naohiro Hayashibara,et al.  The φ Accrual Failure Detector , 2004 .

[3]  刘锋,et al.  Kernel-based virtual machine事件跟踪机制的设计与实现 , 2008 .

[4]  Marcos K. Aguilera,et al.  No Time for Asynchrony , 2009, HotOS.

[5]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[6]  George Candea,et al.  Microreboot - A Technique for Cheap Recovery , 2004, OSDI.

[7]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[8]  Robert Tappan Morris,et al.  Flexible, Wide-Area Storage for Distributed Systems with WheelFS , 2009, NSDI.

[9]  Muli Ben-Yehuda,et al.  The Turtles Project: Design and Implementation of Nested Virtualization , 2010, OSDI.

[10]  Butler W. Lampson,et al.  The ABCD's of Paxos , 2001, PODC '01.

[11]  Robbert van Renesse,et al.  Chain Replication for Supporting High Throughput and Availability , 2004, OSDI.

[12]  Dutch T. Meyer,et al.  Remus: High Availability via Asynchronous Virtual Machine Replication. (Best Paper) , 2008, NSDI.

[13]  Leslie Lamport,et al.  Paxos Made Simple , 2001 .

[14]  David Mazières Paxos Made Practical , 2007 .

[15]  Nancy A. Lynch,et al.  Consensus in the presence of partial synchrony , 1988, JACM.

[16]  Brett D. Fleisch,et al.  The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[17]  Keith Marzullo,et al.  Mencius: Building Efficient Replicated State Machine for WANs , 2008, OSDI.

[18]  Chandramohan A. Thekkath,et al.  Petal: distributed virtual disks , 1996, ASPLOS VII.

[19]  Paulo Veríssimo Uncertainty and predictability: can they be reconciled? , 2003 .

[20]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[21]  Christof Fetzer,et al.  Perfect Failure Detection in Timed Asynchronous Systems , 2003, IEEE Trans. Computers.

[22]  Antonio Casimiro,et al.  The timely computing base: Timely actions in the presence of uncertain timeliness , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[23]  J. D. Day,et al.  A principle for resilient sharing of distributed resources , 1976, ICSE '76.

[24]  Marcos K. Aguilera,et al.  On the quality of service of failure detectors , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[25]  Jerome H. Saltzer,et al.  End-to-end arguments in system design , 1984, TOCS.

[26]  Robert Griesemer,et al.  Paxos made live: an engineering perspective , 2007, PODC '07.

[27]  Lorenzo Alvisi,et al.  The Paxos Register , 2007, 2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007).

[28]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[29]  Antonio Casimiro,et al.  The Timely Computing Base Model and Architecture , 2002, IEEE Trans. Computers.

[30]  GhemawatSanjay,et al.  The Google file system , 2003 .

[31]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[32]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[33]  Robbert van Renesse,et al.  A Gossip-Style Failure Detection Service , 2009 .

[34]  Arun Venkataramani,et al.  Consensus Routing: The Internet as a Distributed System. (Best Paper) , 2008, NSDI.

[35]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[36]  Kenneth P. Birman,et al.  Exploiting virtual synchrony in distributed systems , 1987, SOSP '87.

[37]  Marcos K. Aguilera,et al.  On the Impact of Fast Failure Detectors on Real-Time Fault-Tolerant Systems , 2002, DISC.

[38]  Peng Li,et al.  Paxos Replicated State Machines as the Basis of a High-Performance Data Store , 2011, NSDI.

[39]  Mikel Larrea,et al.  On the impossibility of implementing perpetual failure detectors in partially synchronous systems , 2002, Proceedings 10th Euromicro Workshop on Parallel, Distributed and Network-based Processing.

[40]  Marc Najork,et al.  Boxwood: Abstractions as the Foundation for Storage Infrastructure , 2004, OSDI.

[41]  Paulo Veríssimo,et al.  Uncertainty and Predictability: Can They Be Reconciled? , 2003, Future Directions in Distributed Computing.

[42]  Nancy A. Lynch,et al.  Revisiting the PAXOS algorithm , 1997, Theor. Comput. Sci..

[43]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[44]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[45]  Pierre Sens,et al.  Implementation and performance evaluation of an adaptable failure detector , 2002, Proceedings International Conference on Dependable Systems and Networks.

[46]  George Candea,et al.  Improving availability with recursive microreboots: a soft-state system case study , 2004, Perform. Evaluation.