A simple and fast asynchronous consensus protocol based on a weak failure detector

Summary. The Consensus problem is a fundamental paradigm for fault-tolerant asynchronous systems. It abstracts a family of problems known as Agreement (or Coordination) problems. Any solution to consensus can serve as a basic building block for solving such problems (e.g., atomic commitment or atomic broadcast). Solving consensus in an asynchronous system is not a trivial task: it has been proven (1985) by Fischer, Lynch and Paterson that there is no deterministic solution in asynchronous systems which are subject to even a single crash failure. To circumvent this impossibility result, Chandra and Toueg have introduced the concept of unreliable failure detectors (1991), and have studied how these failure detectors can be used to solve consensus in asynchronous systems with crash failures. This paper presents a new consensus protocol that uses a failure detector of the class $\Diamond{\cal S}$. Like previous protocols, it is based on the rotating coordinator paradigm and proceeds in asynchronous rounds. Simplicity and efficiency are the main characteristics of this protocol. From a performance point of view, the protocol is particularly efficient when, whether failures occur or not, the underlying failure detector makes no mistake (a common case in practice). From a design point of view, the protocol is based on the combination of three simple mechanisms: a voting mechanism, a small finite state automaton which manages the behavior of each process, and the possibility for a process to change its mind during a round.

[1]  Kenneth P. Birman,et al.  Reliable communication in the presence of failures , 1987, TOCS.

[2]  Marcos K. Aguilera,et al.  Randomization and Failure Detection: A Hybrid Approach to Solve Consensus , 1996, WDAG.

[3]  André Schiper Early consensus in an asynchronous system with a weak failure detector , 1997, Distributed Computing.

[4]  Sam Toueg,et al.  The weakest failure detector for solving consensus , 1992, PODC '92.

[5]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1985, JACM.

[6]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[7]  Dale Skeen,et al.  Nonblocking commit protocols , 1981, SIGMOD '81.

[8]  André Schiper,et al.  Consensus: The Big Misunderstanding , 1997 .

[9]  Michael K. Reiter,et al.  Unreliable intrusion detection in distributed computations , 1997, Proceedings 10th Computer Security Foundations Workshop.

[10]  Sam Toueg,et al.  Unreliable failure detectors for asynchronous systems (preliminary version) , 1991, PODC '91.

[11]  Jim Gray,et al.  Notes on Data Base Operating Systems , 1978, Advanced Course: Operating Systems.

[12]  Danny Dolev,et al.  On the minimal synchronism needed for distributed consensus , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[13]  Rachid Guerraoui Revistiting the Relationship Between Non-Blocking Atomic Commitment and Consensus , 1995, WDAG.

[14]  Danny Dolev,et al.  Early stopping in Byzantine agreement , 1990, JACM.

[15]  Nancy A. Lynch,et al.  Consensus in the presence of partial synchrony , 1988, JACM.

[16]  Nancy A. Lynch,et al.  Distributed Algorithms , 1992, Lecture Notes in Computer Science.