Unreliable failure detectors for asynchronous distributed systems

Distributed computing is very attractive, but comes with new problems : information losses, overflow, or breakdowns. Most often, they are neglected. Indeed, it has been shown that the Consensus (a fundamental problem which requires that the processes agree on a common value) is unsolvable in a realistic computing model, i.e. completely asynchronous with possible crash failures [FLP85]. Intuitively, in an asynchronous environment, a process cannot decide if a component is either crashed or very slow. Several approaches were designed to “bypass” that impossibility. One of them is self-stabilization, studied at LaRIA, which deals with transient faults. The principle is to design algorithms which can be executed from any initial state, and eventually work according to its specification. Snap-stabilization is stronger : from any initial step, the algorithm always behaves according to its specification. The first snap-stabilized algorithms were designed at LaRIA. Another approach, which we are going to study, cope with definitive (crash) failures. Ideally, a black box should be attached to each process to indicate precisely the failures of the network. This black box is called a failure detector. But, the result of [FLP85] implies that it is impossible to implement such a perfect failure detector. That is why Chandra and Toueg introduces in [CHT96] the notion of unreliable failure detectors. Even if such detectors are still impossible to implement, practically, this approach allows to implement semi-algorithms. Theoretically, this approach also allows to introduce a hierarchy of the unreliable

[1]  Sam Toueg,et al.  The weakest failure detector for solving consensus , 1992, PODC '92.

[2]  Cynthia Dwork,et al.  Randomization in Byzantine Agreement , 1989, Adv. Comput. Res..

[3]  C. Mohan,et al.  Method for distributed transaction commit and recovery using Byzantine Agreement within clusters of processors , 1983, PODC '83.

[4]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[5]  Sung-Hoon Park The Weakest Failure Detector for Solving Election Problems in Asynchronous Distributed Systems , 2002, EurAsia-ICT.

[6]  Kenneth P. Birman,et al.  Using process groups to implement failure detection in asynchronous environments , 1991, PODC '91.

[7]  Sape J. Mullender,et al.  The Amoeba distributed operating system : selected papers 1984-1987 , 1987 .

[8]  Michael J. Fischer,et al.  The Consensus Problem in Unreliable Distributed Systems (A Brief Survey) , 1983, FCT.

[9]  J. Goldberg,et al.  SIFT: Design and analysis of a fault-tolerant computer for aircraft control , 1978, Proceedings of the IEEE.

[10]  Nancy A. Lynch,et al.  Bounds on the time to reach agreement in the presence of timing uncertainty , 1991, STOC '91.

[11]  Danny Dolev,et al.  Cheating husbands and other stories: A case study of knowledge, action, and communication , 1986, Distributed Computing.

[12]  Nancy A. Lynch,et al.  Reaching approximate agreement in the presence of faults , 1986, JACM.

[13]  Leslie Lamport,et al.  The Implementation of Reliable Distributed Multiprocess Systems , 1978, Comput. Networks.

[14]  Rachid Guerraoui,et al.  Mutual exclusion in asynchronous systems with failure detectors , 2005, J. Parallel Distributed Comput..

[15]  Kenneth P. Birman,et al.  Reliable communication in the presence of failures , 1987, TOCS.

[16]  D. McCue,et al.  Fault-Tolerance in the Advanced Automation System , 1991, OPSR.

[17]  Danny Dolev,et al.  On the minimal synchronism needed for distributed consensus , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[18]  Hector Garcia-Molina,et al.  Reliable scheduling in a TMR database system , 1989, TOCS.

[19]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[20]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[21]  Ronald J. Watro,et al.  Fault-tolerant decision making in totally asynchronous distributed systems , 1987, ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing.

[22]  Shmuel Zaks,et al.  A combinatorial characterization of the distributed tasks which are solvable in the presence of one faulty processor , 1988, PODC '88.

[23]  Sam Toueg,et al.  Time and Message Efficient Reliable Broadcasts , 1990, WDAG.

[24]  Leslie Lamport,et al.  Reaching Agreement in the Presence of Faults , 1980, JACM.

[25]  Flaviu Cristian,et al.  Atomic Broadcast: From Simple Message Diffusion to Byzantine Agreement , 1995, Inf. Comput..

[26]  Hagit Attiya,et al.  Achievable cases in an asynchronous environment , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[27]  Vassos Hadzilacos,et al.  Issues of fault tolerance in concurrent computations (databases, reliability, transactions, agreement protocols, distributed computing) , 1985 .

[28]  I. Bey,et al.  Delta-4: A Generic Architecture for Dependable Distributed Computing , 1991, Research Reports ESPRIT.

[29]  Yair Amir,et al.  Transis: A Communication Sub-system for High Availability , 1992 .

[30]  Rüdiger Reischuk,et al.  A New Solution for the Byzantine Generals Problem , 1985, Inf. Control..

[31]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[32]  Piotr Berman,et al.  Towards optimal distributed consensus , 1989, 30th Annual Symposium on Foundations of Computer Science.

[33]  Richard D. Schlichting,et al.  Preserving and using context information in interprocess communication , 1989, TOCS.