Design of a notification system for the /spl phi/ accrual failure detector

It is widely recognized that distributed systems would greatly benefit from the availability of a generic failure detection service. There are however several issues that must be addressed before such a service can actually be implemented. Traditionally, failure detectors or failure detection services provide a list of processes that are currently suspected by them. Mechanisms for propagating such information are implemented mostly for such traditional failure detectors. Recently, a family of failure detectors that provide the degree of confidence that a given process has actually crashed, called suspicion level. It is called the /spl phi/ failure detector which is an implementation of the notion of accrual failure detectors. In this paper, we highlight the issue on the propagation mechanism of information on crashed/suspected processes with the /spl phi/ failure detector. Since the suspicion level is represented as a continuous value, existing mechanisms are not appropriate for this type of failure detectors. Therefore, we propose a notification system that can efficiently propagate suspicion levels. It can provide such information to proper receivers and processes in distributed applications do not need to implement a function for failure detection by using the proposed system and the /spl phi/ failure detector.

[1]  Gregor von Laszewski,et al.  A fault detection service for wide area distributed computations , 2004, Cluster Computing.

[2]  Pierre Sens,et al.  Implementation and performance evaluation of an adaptable failure detector , 2002, Proceedings International Conference on Dependable Systems and Networks.

[3]  Robbert van Renesse,et al.  A Gossip-Style Failure Detection Service , 2009 .

[4]  Indranil Gupta,et al.  On scalable and efficient distributed failure detectors , 2001, PODC '01.

[5]  Pierre Sens,et al.  Performance analysis of a hierarchical failure detector , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[6]  Rachid Guerraoui,et al.  Failure detectors as first class objects , 1999, Proceedings of the International Symposium on Distributed Objects and Applications.

[7]  Péter Urbán,et al.  Definition and specification of accrual failure detectors , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[8]  Marcos K. Aguilera,et al.  On the quality of service of failure detectors , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[9]  V. Jacobson,et al.  Congestion avoidance and control , 1988, CCRV.

[10]  Naohiro Hayashibara,et al.  The φ Accrual Failure Detector , 2004 .

[11]  Bernadette Charron-Bost,et al.  Solving Problems in the Presence of Process Crashes and Lossy Links , 1996 .

[12]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1985, JACM.

[13]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[14]  Michael B. Jones,et al.  FUSE: Lightweight Guaranteed Distributed Failure Notification , 2004, OSDI.