论文信息 - Consensus in asynchronous systems where processes can crash and recover

Consensus in asynchronous systems where processes can crash and recover

The consensus problem is now well identified as being one of the most important problems encountered in the design and the construction of fault-tolerant distributed systems. This problem is defined as follows: processes have to reach a common decision, which depends on their inputs, despite failures. We consider the consensus problem in asynchronous distributed systems augmented with unreliable failure detectors. Several protocols have been proposed for these systems, when process crashes are assumed to be definitive. This paper addresses the consensus problem in a more practical asynchronous system model, namely in a context where processes can crash and recover. As a process crash entails the loss of its volatile memory, each process is equipped with a stable storage. So, to be efficient a consensus protocol has to log as few critical data as possible. The proposed protocol uses a new class of failure detectors suited to the crash/recovery model. It is particularly efficient when, whether there are crashes or not, the underlying failure detector makes few mistakes. Additionally, the proposed protocol tolerates message duplication and copes with some message losses.

Achour Mostéfaoui | Michel Raynal | Michel Hurfin

[1] Roy Friedman,et al. Failure detectors in omission failure environments , 1997, PODC '97.

[2] Nancy A. Lynch,et al. Impossibility of distributed consensus with one faulty process , 1985, JACM.

[3] MICHEL HURFIN,et al. FAST ASYNCHRONOUS CONSENSUS BASED ON A WEAK FAILURE DETECTOR , 1997 .

[4] Sam Toueg,et al. Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[5] André Schiper,et al. Stubborn Communication Channels , 1998 .

[6] Marcos K. Aguilera,et al. Heartbeat: A Timeout-Free Failure Detector for Quiescent Reliable Communication , 1997, WDAG.

[7] Marcos K. Aguilera,et al. Randomization and Failure Detection: A Hybrid Approach to Solve Consensus , 1996, WDAG.

[8] André Schiper. Early consensus in an asynchronous system with a weak failure detector , 1997, Distributed Computing.

[9] André Schiper,et al. Consensus in the Crash-Recover Model , 1997 .

[10] Rachid Guerraoui. Revistiting the Relationship Between Non-Blocking Atomic Commitment and Consensus , 1995, WDAG.

[11] Marcos K. Aguilera,et al. Failure detection and consensus in the crash-recovery model , 2000, Distributed Computing.

[12] Sam Toueg,et al. The weakest failure detector for solving consensus , 1992, PODC '92.