论文信息 - Unreliable failure detectors for asynchronous distributed systems

Unreliable failure detectors for asynchronous distributed systems

Distributed computing is very attractive, but comes with new problems : information losses, overflow, or breakdowns. Most often, they are neglected. Indeed, it has been shown that the Consensus (a fundamental problem which requires that the processes agree on a common value) is unsolvable in a realistic computing model, i.e. completely asynchronous with possible crash failures [FLP85]. Intuitively, in an asynchronous environment, a process cannot decide if a component is either crashed or very slow. Several approaches were designed to “bypass” that impossibility. One of them is self-stabilization, studied at LaRIA, which deals with transient faults. The principle is to design algorithms which can be executed from any initial state, and eventually work according to its specification. Snap-stabilization is stronger : from any initial step, the algorithm always behaves according to its specification. The first snap-stabilized algorithms were designed at LaRIA. Another approach, which we are going to study, cope with definitive (crash) failures. Ideally, a black box should be attached to each process to indicate precisely the failures of the network. This black box is called a failure detector. But, the result of [FLP85] implies that it is impossible to implement such a perfect failure detector. That is why Chandra and Toueg introduces in [CHT96] the notion of unreliable failure detectors. Even if such detectors are still impossible to implement, practically, this approach allows to implement semi-algorithms. Theoretically, this approach also allows to introduce a hierarchy of the unreliable

[1] Sam Toueg,et al. The weakest failure detector for solving consensus , 1992, PODC '92.

[2] Cynthia Dwork,et al. Randomization in Byzantine Agreement , 1989, Adv. Comput. Res..

[3] C. Mohan,et al. Method for distributed transaction commit and recovery using Byzantine Agreement within clusters of processors , 1983, PODC '83.

[4] Nancy A. Lynch,et al. Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[5] Sung-Hoon Park. The Weakest Failure Detector for Solving Election Problems in Asynchronous Distributed Systems , 2002, EurAsia-ICT.

[6] Kenneth P. Birman,et al. Using process groups to implement failure detection in asynchronous environments , 1991, PODC '91.

[7] Sape J. Mullender,et al. The Amoeba distributed operating system : selected papers 1984-1987 , 1987 .

[8] Michael J. Fischer,et al. The Consensus Problem in Unreliable Distributed Systems (A Brief Survey) , 1983, FCT.

[9] J. Goldberg,et al. SIFT: Design and analysis of a fault-tolerant computer for aircraft control , 1978, Proceedings of the IEEE.

[10] Nancy A. Lynch,et al. Bounds on the time to reach agreement in the presence of timing uncertainty , 1991, STOC '91.

[11] Danny Dolev,et al. Cheating husbands and other stories: A case study of knowledge, action, and communication , 1986, Distributed Computing.

[12] Nancy A. Lynch,et al. Reaching approximate agreement in the presence of faults , 1986, JACM.

[13] Leslie Lamport,et al. The Implementation of Reliable Distributed Multiprocess Systems , 1978, Comput. Networks.

[14] Rachid Guerraoui,et al. Mutual exclusion in asynchronous systems with failure detectors , 2005, J. Parallel Distributed Comput..

[15] Kenneth P. Birman,et al. Reliable communication in the presence of failures , 1987, TOCS.

[16] D. McCue,et al. Fault-Tolerance in the Advanced Automation System , 1991, OPSR.

[17] Danny Dolev,et al. On the minimal synchronism needed for distributed consensus , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[18] Hector Garcia-Molina,et al. Reliable scheduling in a TMR database system , 1989, TOCS.

[19] Sam Toueg,et al. Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[20] Leslie Lamport,et al. The Byzantine Generals Problem , 1982, TOPL.

[21] Ronald J. Watro,et al. Fault-tolerant decision making in totally asynchronous distributed systems , 1987, ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing.

[22] Shmuel Zaks,et al. A combinatorial characterization of the distributed tasks which are solvable in the presence of one faulty processor , 1988, PODC '88.

[23] Sam Toueg,et al. Time and Message Efficient Reliable Broadcasts , 1990, WDAG.

[24] Leslie Lamport,et al. Reaching Agreement in the Presence of Faults , 1980, JACM.

[25] Flaviu Cristian,et al. Atomic Broadcast: From Simple Message Diffusion to Byzantine Agreement , 1995, Inf. Comput..

[26] Hagit Attiya,et al. Achievable cases in an asynchronous environment , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[27] Vassos Hadzilacos,et al. Issues of fault tolerance in concurrent computations (databases, reliability, transactions, agreement protocols, distributed computing) , 1985 .

[28] I. Bey,et al. Delta-4: A Generic Architecture for Dependable Distributed Computing , 1991, Research Reports ESPRIT.

[29] Yair Amir,et al. Transis: A Communication Sub-system for High Availability , 1992 .

[30] Rüdiger Reischuk,et al. A New Solution for the Byzantine Generals Problem , 1985, Inf. Control..

[31] Fred B. Schneider,et al. Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[32] Piotr Berman,et al. Towards optimal distributed consensus , 1989, 30th Annual Symposium on Foundations of Computer Science.

[33] Richard D. Schlichting,et al. Preserving and using context information in interprocess communication , 1989, TOCS.