Unreliable failure detectors for asynchronous systems (preliminary version)

Failure Detectors for Asynchronous Systems* (Preliminary Version) Tushar Deepak Chandra and Sam Toueg Department of Computer Science Upson Hall, Cornell University Ithaca, New York 14853 chandra, samacs. cornell. edu We introduce the concept of failure detectors for asynchronous syst ems with crash failures. We show that even with a failure detector that makes an unbounded and possibly infinite number of mistakes, we can solve the Consensus and Atomic Broadca~t problems, two fundamental paradigms of fault-tolerant computing that are known to be unsolvable in asynchronous syst ems, We characterize failure detectors in terms of their completeness and accuracy properties, and classify them in a hierarchy ordered by a reducibility relation. We present matching upper and lower bounds on the fault-tolerance of solutions to Consensus and Atomic Broadcast for members of this hierarchy.

[1]  Fred B. Schneider,et al.  Byzantine generals in action: implementing fail-stop processors , 1984, TOCS.

[2]  Ronald J. Watro,et al.  Fault-tolerant decision making in totally asynchronous distributed systems , 1987, ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing.

[3]  Cynthia Dwork,et al.  Randomization in Byzantine Agreement , 1989, Adv. Comput. Res..

[4]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[5]  Jo-Mei Chang,et al.  Reliable broadcast protocols , 1984, TOCS.

[6]  Sam Toueg,et al.  Early-Stopping Distributed Bidding and Applications (Preliminary Version) , 1990, WDAG.

[7]  Rüdiger Reischuk,et al.  A New Solution for the Byzantine Generals Problem , 1985, Inf. Control..

[8]  Hagit Attiya,et al.  Achievable cases in an asynchronous environment , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[9]  Sam Toueg,et al.  Unreliable Failure Detectors for Asynchronous Systems , 1991 .

[10]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[11]  Sam Toueg,et al.  Time and Message Efficient Reliable Broadcasts , 1990, WDAG.

[12]  Flaviu Cristian,et al.  Early-delivery atomic broadcast , 1990, PODC '90.

[13]  Nancy A. Lynch,et al.  Bounds on the time to reach agreement in the presence of timing uncertainty , 1991, STOC '91.

[14]  Nancy A. Lynch,et al.  Consensus in the presence of partial synchrony , 1988, JACM.

[15]  Nancy A. Lynch,et al.  Reaching approximate agreement in the presence of faults , 1986, JACM.

[16]  Danny Dolev,et al.  On the minimal synchronism needed for distributed consensus , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[17]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1985, JACM.

[18]  Kenneth P. Birman,et al.  Using process groups to implement failure detection in asynchronous environments , 1991, PODC '91.

[19]  Kenneth P. Birman,et al.  Reliable communication in the presence of failures , 1987, TOCS.