On the fly estimation of the processes that are alive/crashed in an asynchronous message-passing system

It is well-known that, in an asynchronous system where processes are prone to crash, it is impossible to design a protocol that provides each process with the set of processes that are currently alive. Basically, this comes from the fact that it is impossible to distinguish a crashed process from a process that is very slow or with which communications are very slow. Nevertheless, designing protocols that provide the processes with good approximations of the set of processes that are currently alive remains a real challenge in fault-tolerant distributed computing. This paper proposes such a protocol. To that end, it considers a realistic computation model where the processes are provided with non-synchronized local clocks and a function alpha(). That function takes a local duration as a parameter, and returns an integer that is an estimate of the number of processes that can crash during that duration. A simulation-based experimental evaluation of the protocol is also presented. The experiments show that the protocol is practically relevant

[1]  Flaviu Cristian,et al.  The Timed Asynchronous Distributed System Model , 1999, IEEE Trans. Parallel Distributed Syst..

[2]  Naohiro Hayashibara,et al.  The φ Accrual Failure Detector , 2004 .

[3]  Michel Raynal,et al.  Group membership failure detection: a simple protocol and its probabilistic analysis , 1999, Distributed Syst. Eng..

[4]  Seif Haridi,et al.  Distributed Algorithms , 1992, Lecture Notes in Computer Science.

[5]  Robbert van Renesse,et al.  A Gossip-Style Failure Detection Service , 2009 .

[6]  Nancy A. Lynch,et al.  Consensus in the presence of partial synchrony , 1988, JACM.

[7]  Ramesh Govindan,et al.  An empirical evaluation of internet latency expansion , 2005, CCRV.

[8]  William H. Sanders,et al.  An Adaptive Quality of Service Aware Middleware for Replicated Services , 2003, IEEE Trans. Parallel Distributed Syst..

[9]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1985, JACM.

[10]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[11]  Jon M. Kleinberg,et al.  The small-world phenomenon: an algorithmic perspective , 2000, STOC '00.

[12]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[13]  Michael O. Rabin,et al.  Randomized byzantine generals , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[14]  Michael Ben-Or,et al.  Another advantage of free choice (Extended Abstract): Completely asynchronous agreement protocols , 1983, PODC '83.

[15]  Michel Raynal,et al.  An adaptive failure detection protocol , 2001, Proceedings 2001 Pacific Rim International Symposium on Dependable Computing.

[16]  David Powell Failure mode assumptions and assumption coverage , 1992 .

[17]  Achour Mostéfaoui,et al.  Asynchronous implementation of failure detectors , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[18]  Christof Fetzer,et al.  Perfect Failure Detection in Timed Asynchronous Systems , 2003, IEEE Trans. Computers.

[19]  Hagit Attiya,et al.  Distributed Computing: Fundamentals, Simulations and Advanced Topics , 1998 .