The failure detector abstraction

A failure detector is a fundamental abstraction in distributed computing. This article surveys this abstraction through two dimensions. First we study failure detectors as building blocks to simplify the design of reliable distributed algorithms. In particular, we illustrate how failure detectors can factor out timing assumptions to detect failures in distributed agreement algorithms. Second, we study failure detectors as computability benchmarks. That is, we survey the weakest failure detector question and illustrate how failure detectors can be used to classify problems. We also highlight some limitations of the failure detector abstraction along each of the dimensions.

[1]  K. Mani Chandy,et al.  Parallel program design - a foundation , 1988 .

[2]  Hagen Völzer On Randomization Versus Synchronization in Distributed Systems , 2004, ICALP.

[3]  Michael Ben-Or,et al.  Another advantage of free choice (Extended Abstract): Completely asynchronous agreement protocols , 1983, PODC '83.

[4]  Michael E. Saks,et al.  Wait-free k-set agreement is impossible: the topology of public knowledge , 1993, STOC.

[5]  Rachid Guerraoui,et al.  Indulgent algorithms (preliminary version) , 2000, PODC '00.

[6]  David Powell Failure mode assumptions and assumption coverage , 1992 .

[7]  Michel Raynal,et al.  In Search of the Holy Grail: Looking for the Weakest Failure Detector for Wait-Free Set Agreement , 2006, OPODIS.

[8]  Toshimitsu Masuzawa,et al.  Fault-Tolerant and Self-Stabilizing Protocols Using an Unreliable Failure Detector , 2000 .

[9]  Rachid Guerraoui,et al.  Failure detectors as type boosters , 2007, Distributed Computing.

[10]  Vijay K. Garg,et al.  Implementable Failure Detectors in Asynchronous Systems , 1998, FSTTCS.

[11]  Felix C. Freiling,et al.  Revisiting Failure Detection and Consensus in Omission Failure Environments , 2005, ICTAC.

[12]  Edsger W. Dijkstra,et al.  Self-stabilizing systems in spite of distributed control , 1974, CACM.

[13]  Michael K. Reiter,et al.  Unreliable intrusion detection in distributed computations , 1997, Proceedings 10th Computer Security Foundations Workshop.

[14]  Rachid Guerraoui,et al.  Genuine Atomic Multicast , 1997, WDAG.

[15]  André Schiper,et al.  Generic Broadcast , 1999, DISC.

[16]  Sam Toueg,et al.  The weakest failure detector to solve nonuniform consensus , 2005, PODC '05.

[17]  Vern Paxson,et al.  Experiences with NIMI , 2002, Proceedings 2002 Symposium on Applications and the Internet (SAINT) Workshops.

[18]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[19]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[20]  Josef Widder,et al.  On the Possibility and the Impossibility of Message-Driven Self-stabilizing Failure Detection , 2005, Self-Stabilizing Systems.

[21]  André Schiper,et al.  Consensus in the Crash-Recover Model , 1997 .

[22]  Eli Gafni,et al.  Round-by-round fault detectors (extended abstract): unifying synchrony and asynchrony , 1998, PODC '98.

[23]  Gil Neiger Failure detectors and the wait-free hierarchy (extended abstract) , 1995, PODC '95.

[24]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[25]  Rachid Guerraoui,et al.  Consensus in Asynchronous Distributed Systems: A Concise Guided Tour , 1999, Advances in Distributed Systems.

[26]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[27]  Achour Mostéfaoui,et al.  Consensus in One Communication Step , 2001, PaCT.

[28]  Marcos K. Aguilera,et al.  Stable Leader Election , 2001, DISC.

[29]  Marcos K. Aguilera,et al.  On the Impact of Fast Failure Detectors on Real-Time Fault-Tolerant Systems , 2002, DISC.

[30]  Vassos Hadzilacos,et al.  Issues of fault tolerance in concurrent computations (databases, reliability, transactions, agreement protocols, distributed computing) , 1985 .

[31]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[32]  Giuseppe Di Battista,et al.  26 Computer Networks , 2004 .

[33]  Barbara Liskov,et al.  Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems , 1999, PODC '88.

[34]  Roy Friedman,et al.  Failure detectors in omission failure environments , 1997, PODC '97.

[35]  Hagit Attiya,et al.  Sharing memory robustly in message-passing systems , 1990, PODC '90.

[36]  Rachid Guerraoui,et al.  The Weakest Failure Detector for Message Passing Set-Agreement , 2008, DISC.

[37]  Achour Mostéfaoui,et al.  Exploring Gafni's Reduction Land: From Omegak to Wait-Free Adaptive (2p-[p/k])-Renaming Via k-Set Agreement , 2006, DISC.

[38]  Felix C. Freiling,et al.  Failure Detection Sequencers: Necessary and Sufficient Information about Failures to Solve Predicate Detection , 2002, DISC.

[39]  Marcos K. Aguilera,et al.  Failure detection and consensus in the crash-recovery model , 2000, Distributed Computing.

[40]  Rachid Guerraoui,et al.  The Weakest Failure Detectors to Boost Obstruction-Freedom , 2006, DISC.

[41]  Hagen Völzer,et al.  On Conspiracies and Hyperfairness in Distributed Computing , 2005, DISC.

[42]  Darrell D. E. Long,et al.  A study of the reliability of Internet sites , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.

[43]  Rachid Guerraoui,et al.  Non-blocking atomic commit in asynchronous distributed systems with failure detectors , 2002, Distributed Computing.

[44]  Gil Neiger,et al.  Failure Detectors and the Wait-Free Hierarchy. , 1995, ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing.

[45]  Baruch Awerbuch,et al.  Atomic shared register access by asynchronous hardware , 1986, 27th Annual Symposium on Foundations of Computer Science (sfcs 1986).

[46]  Miroslaw Malek,et al.  The consensus problem in fault-tolerant computing , 1993, CSUR.

[47]  Vijay K. Garg,et al.  Distributed predicate detection in a faulty environment , 1998, Proceedings. 18th International Conference on Distributed Computing Systems (Cat. No.98CB36183).

[48]  Marcos K. Aguilera,et al.  Using the Heartbeat Failure Detector for Quiescent Reliable Communication and Consensus in Partitionable Networks , 1999, Theor. Comput. Sci..

[49]  Felix C. Freiling,et al.  (Im)Possibilities of Predicate Detection in Crash-Affected Systems , 2001, WSS.

[50]  Nancy A. Lynch,et al.  On the weakest failure detector ever , 2007, PODC.

[51]  Sam Toueg,et al.  The weakest failure detector for solving consensus , 1996, JACM.

[52]  Marcos K. Aguilera,et al.  Thrifty Generic Broadcast , 2000, DISC.

[53]  Felix C. Freiling,et al.  Efficient Reduction for Wait-Free Termination Detection in a Crash-Prone Distributed System , 2005, DISC.

[54]  Maurice Herlihy,et al.  The topological structure of asynchronous computability , 1999, JACM.

[55]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[56]  Rachid Guerraoui,et al.  Synchronous system and perfect failure detector: Solvability and efficiency issues , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[57]  Nancy A. Lynch,et al.  Consensus in the presence of partial synchrony , 1988, JACM.

[58]  Josef Widder,et al.  Implementing Reliable Distributed Real-Time Systems with the Theta-Model , 2005, OPODIS.

[59]  Felix C. Freiling,et al.  Illustrating the impossibility of crash-tolerant consensus in asynchronous systems , 2006, OPSR.

[60]  David Powell,et al.  Failure mode assumptions and assumption coverage , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[61]  Marcos K. Aguilera,et al.  On the quality of service of failure detectors , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[62]  Flaviu Cristian,et al.  The Timed Asynchronous Distributed System Model , 1999, IEEE Trans. Parallel Distributed Syst..

[63]  Sam Toueg,et al.  A Modular Approach to Fault-Tolerant Broadcasts and Related Problems , 1994 .

[64]  Achour Mostéfaoui,et al.  Consensus in asynchronous systems where processes can crash and recover , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[65]  Hagit Attiya,et al.  Distributed Computing: Fundamentals, Simulations and Advanced Topics , 1998 .

[66]  Soma Chaudhuri,et al.  Agreement is harder than consensus: set consensus problems in totally asynchronous systems , 1990, PODC '90.

[67]  Eli Gafni,et al.  Generalized FLP impossibility result for t-resilient asynchronous computations , 1993, STOC.

[68]  Michel Raynal,et al.  A simple and fast asynchronous consensus protocol based on a weak failure detector , 1999, Distributed Computing.

[69]  Hagit Attiya,et al.  Renaming in an asynchronous environment , 1990, JACM.

[70]  Marcos K. Aguilera,et al.  On Quiescent Reliable Communication , 2000, SIAM J. Comput..

[71]  Rachid Guerraoui,et al.  The weakest failure detectors to solve certain fundamental problems in distributed computing , 2004, PODC '04.

[72]  Rachid Guerraoui,et al.  The gap in circumventing the impossibility of consensus , 2008, J. Comput. Syst. Sci..

[73]  Marcin Paprzycki,et al.  Distributed Computing: Fundamentals, Simulations and Advanced Topics , 2001, Scalable Comput. Pract. Exp..

[74]  Vassos Hadzilacos,et al.  Using Failure Detectors to Solve Consensus in Asynchronous Sharde-Memory Systems (Extended Abstract) , 1994, WDAG.

[75]  Mikel Larrea,et al.  Eventually consistent failure detectors , 2001, SPAA '01.

[76]  Andreas Haeberlen,et al.  PeerReview: practical accountability for distributed systems , 2007, SOSP.

[77]  Joffroy Beauquier,et al.  Fault-tolerance and self-stabilization: impossibility results and solutions using self-stabilizing failure detectors , 1997, Int. J. Syst. Sci..

[78]  Rachid Guerraoui,et al.  Mutual exclusion in asynchronous systems with failure detectors , 2005, J. Parallel Distributed Comput..

[79]  Marcos K. Aguilera,et al.  Failure Detection and Randomization: A Hybrid Approach to Solve Consensus , 1998, SIAM J. Comput..

[80]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[81]  Felix C. Freiling,et al.  Consistent detection of global predicates under a weak fault assumption , 2000, Proceedings 19th IEEE Symposium on Reliable Distributed Systems SRDS-2000.

[82]  Rachid Guerraoui,et al.  "Gamma-Accurate" Failure Detectors , 1996, WDAG.

[83]  Sam Toueg,et al.  Every problem has a weakest failure detector , 2008, PODC '08.

[84]  Christof Fetzer,et al.  On the Possibility of Consensus in Asynchronous Systems with Finite Average Response Times , 2005, 25th IEEE International Conference on Distributed Computing Systems (ICDCS'05).

[85]  Rachid Guerraoui,et al.  Muteness Failure Detectors: Specification and Implementation , 1999, EDCC.

[86]  Dennis Shasha,et al.  The many faces of consensus in distributed systems , 1992, Computer.

[87]  Gérard Le Lann,et al.  Fast Asynchronous Uniform Consensus in Real-Time Distributed Systems , 2002, IEEE Trans. Computers.

[88]  Eli Gafni,et al.  Round-by-Round Fault Detectors: Unifying Synchrony and Asynchrony (Extended Abstract). , 1998, PODC 1998.

[89]  Keith Marzullo,et al.  Election Vs. Consensus in Asynchronous Systems , 1995 .

[90]  Sam Toueg,et al.  The weakest failure detector for solving consensus , 1992, PODC '92.

[91]  Michael Ben-Or,et al.  Another advantage of free choice (Extended Abstract): Completely asynchronous agreement protocols , 1983, PODC '83.

[92]  Michel Raynal Consensus in synchronous systems: a concise guided tour , 2002, 2002 Pacific Rim International Symposium on Dependable Computing, 2002. Proceedings..

[93]  Rachid Guerraoui,et al.  Encapsulating Failure Detection: From Crash to Byzantine Failures , 2002, Ada-Europe.

[94]  Amos Israeli,et al.  Bounded time-stamps , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[95]  Rachid Guerraoui,et al.  Shared Memory vs Message Passing , 2003 .

[96]  Anish Arora,et al.  Detectors and correctors: a theory of fault-tolerance components , 1998, Proceedings. 18th International Conference on Distributed Computing Systems (Cat. No.98CB36183).

[97]  Louise E. Moser,et al.  Byzantine Fault Detectors for Solving Consensus , 2003, Comput. J..

[98]  Piotr Zielinski Anti-Ω: the weakest failure detector for set agreement , 2008, PODC '08.

[99]  Marcos K. Aguilera,et al.  On implementing omega with weak reliability and synchrony assumptions , 2003, PODC '03.

[100]  Yehuda Afek,et al.  Failure detectors in loosely named systems , 2008, PODC '08.

[101]  Fred B. Schneider What good are models and what models are good , 1993 .

[102]  Flaviu Cristian,et al.  The Timed Asynchronous Distributed System Model , 1998, IEEE Trans. Parallel Distributed Syst..

[103]  Mikel Larrea,et al.  Optimal implementation of the weakest failure detector for solving consensus , 2000, Proceedings 19th IEEE Symposium on Reliable Distributed Systems SRDS-2000.

[104]  André Schiper,et al.  The Heard-Of Model: Unifying all Benign Failures , 2006 .

[105]  共立出版株式会社 コンピュータ・サイエンス : ACM computing surveys , 1978 .

[106]  André Schiper Early consensus in an asynchronous system with a weak failure detector , 1997, Distributed Computing.

[107]  Piotr Zielinski Automatic Classification of Eventual Failure Detectors , 2007, DISC.

[108]  Francis C. Chu Reducing &Ω to ◊ W , 1998 .

[109]  Danny Dolev,et al.  On the minimal synchronism needed for distributed consensus , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[110]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[111]  Nancy A. Lynch,et al.  On the weakest failure detector ever , 2007, PODC '07.

[112]  Rachid Guerraoui,et al.  Tolerating Arbitrary Failures with State Machine Replication , 2005 .

[113]  André Schiper,et al.  Failure Detectors: implementation issues and impact on consensus performance , 1999 .

[114]  Shlomi Dolev,et al.  Self Stabilization , 2004, J. Aerosp. Comput. Inf. Commun..

[115]  A. J. M. van Gasteren,et al.  Derivation of a Termination Detection Algorithm for Distributed Computations , 1983, Inf. Process. Lett..

[116]  GuerraouiRachid,et al.  The failure detector abstraction , 2011 .