Accrual Failure Detectors

Failure detection is a fundamental building block for ensuring fault tolerance in distributed systems. For this reason, many people have been advocating that failure detection should be provided as some form of service [1, 2, 3, 4, 5], similar to IP address lookup (DNS) or time synchronization (e.g., NTP). Unfortunately, in spite of important technical breakthroughs, this view has met little success so far. We believe that one of the main reasons is the conventional binary interaction (i.e., trust vs. suspect) that makes it difficult to meet the requirements of several distributed applications running simultaneously. For this reason, we advocate a different abstraction that helps decoupling application requirements from issues related to the underlying system. It is well-known that there exists an inherent tradeoff between (1) conservative failure detection (i.e., reducing the risk of wrongly suspecting a running process), and (2) aggressive failure detection (i.e., quickly detecting the occurrence of a real crash). Thre exists a continuum of valid choices between these two extremes, and what defines an appropriate choice is strongly related to application requirements.

[1]  Gregor von Laszewski,et al.  A fault detection service for wide area distributed computations , 1998, Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244).

[2]  Keith Marzullo,et al.  Election Vs. Consensus in Asynchronous Systems , 1995 .

[3]  Michael Ben-Or,et al.  Another advantage of free choice (Extended Abstract): Completely asynchronous agreement protocols , 1983, PODC '83.

[4]  Michel Raynal,et al.  An adaptive failure detection protocol , 2001, Proceedings 2001 Pacific Rim International Symposium on Dependable Computing.

[5]  Sam Toueg,et al.  The weakest failure detector for solving consensus , 1996, JACM.

[6]  Xavier Défago,et al.  Group communication based on standard interfaces , 2003, Second IEEE International Symposium on Network Computing and Applications, 2003. NCA 2003..

[7]  Pierre Sens,et al.  Implementation and performance evaluation of an adaptable failure detector , 2002, Proceedings International Conference on Dependable Systems and Networks.

[8]  Anne-Marie Kermarrec,et al.  Probabilistic Reliable Dissemination in Large-Scale Systems , 2003, IEEE Trans. Parallel Distributed Syst..

[9]  Péter Urbán,et al.  Performance comparison of a rotating coordinator and a leader based consensus algorithm , 2004, Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004..

[10]  Roy Friedman Fuzzy group membership , 2003 .

[11]  Naohiro Hayashibara,et al.  Flexible Failure Detection with к-FD , 2004 .

[12]  Nancy A. Lynch,et al.  Consensus in the presence of partial synchrony , 1988, JACM.

[13]  Idit Keidar,et al.  Moshe: A group membership service for WANs , 2002, TOCS.

[14]  V. Jacobson,et al.  Congestion avoidance and control , 1988, CCRV.

[15]  Xavier Défago,et al.  Semi-passive replication , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[16]  Marcos K. Aguilera,et al.  On the quality of service of failure detectors based on control theory , 2006, 20th International Conference on Advanced Information Networking and Applications - Volume 1 (AINA'06).

[17]  D Xavier,et al.  On the Design of a Failure Detection Service for Large-Scale Distributed Systems , 2003 .

[18]  Achour Mostéfaoui,et al.  Crash-resilient time-free eventual leadership , 2004, Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004..

[19]  Leslie Lamport,et al.  Reaching Agreement in the Presence of Faults , 1980, JACM.

[20]  Achour Mostéfaoui,et al.  Asynchronous implementation of failure detectors , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[21]  Paulo Veríssimo,et al.  Using Tailored Failure Suspectors to Support Distributed Cooperative Applications , 1995, Parallel and Distributed Computing and Systems.

[22]  Péter Urbán,et al.  Performance Comparison Between the Paxos and Chandra-Toueg Consensus Algorithms , 2002 .

[23]  Naohiro Hayashibara,et al.  The φ Accrual Failure Detector , 2004 .

[24]  Rachid Guerraoui,et al.  The Generic Consensus Service , 2001, IEEE Trans. Software Eng..

[25]  Anne-Marie Kermarrec,et al.  Peer-to-Peer Membership Management for Gossip-Based Protocols , 2003, IEEE Trans. Computers.

[26]  Edmundo Roberto Mauro Madeira,et al.  ADAPTATION - Algorithms to Adaptive Fault Monitoring and their implementation on CORBA , 2001, Proceedings 3rd International Symposium on Distributed Objects and Applications.

[27]  林原 尚浩 Accrual failure detectors , 2004 .

[28]  Bernadette Charron-Bost,et al.  Solving Problems in the Presence of Process Crashes and Lossy Links , 1996 .

[29]  Péter Urbán,et al.  Definition and specification of accrual failure detectors , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[30]  Xavier Défago,et al.  Impact of a failure detection mechanism on the performance of consensus , 2001, Proceedings 2001 Pacific Rim International Symposium on Dependable Computing.

[31]  Naohiro Hayashibara,et al.  Failure detectors for large-scale distributed systems , 2002, 21st IEEE Symposium on Reliable Distributed Systems, 2002. Proceedings..

[32]  Marcos K. Aguilera,et al.  Using the Heartbeat Failure Detector for Quiescent Reliable Communication and Consensus in Partitionable Networks , 1999, Theor. Comput. Sci..

[33]  Xavier Défago,et al.  Optimization techniques for replicating CORBA objects , 1999, 1999 Proceedings. Fourth International Workshop on Object-Oriented Real-Time Dependable Systems.

[34]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1985, JACM.

[35]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[36]  Pierre Sens,et al.  Performance analysis of a hierarchical failure detector , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[37]  Rachid Guerraoui,et al.  Failure detectors as first class objects , 1999, Proceedings of the International Symposium on Distributed Objects and Applications.

[38]  Steven Tuecke,et al.  The Anatomy of the Grid , 2003 .

[39]  Robbert van Renesse,et al.  A Gossip-Style Failure Detection Service , 2009 .

[40]  Jorge C. A. de Figueiredo,et al.  How bad are wrong suspicions? towards adaptive distributed protocols , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..