Encapsulating Failure Detection: From Crash to Byzantine Failures

Separating different aspects of a program, and encapsulating them inside well defined modules, is considered a good engineering discipline. This discipline is particularly desirable in the development of distributed agreement algorithms which are known to be difficult and error prone. For such algorithms, one aspect that is important to encapsulate is failure detection. In fact, a complete encapsulation was proven to be feasible in the context of distributed systems with process crash failures, by using black-box failure detectors. This paper discusses the feasibility of a similar encapsulation in the context of Byzantine (also called arbitrary or malicious) failures. We argue that, in the Byzantine context, it is just impossible to achieve the level of encapsulation of the original crash failure detector model. However, we also argue that there is some room for an intermediate approach where algorithms that solve agreement problems, such as consensus and atomic broadcast, can still benefit from grey-box failure detectors that partially encapsulate Byzantine failure detection.

[1]  Louise E. Moser,et al.  Total ordering algorithms , 1991, CSC '91.

[2]  Michael J. Fischer,et al.  The Consensus Problem in Unreliable Distributed Systems (A Brief Survey) , 1983, FCT.

[3]  Nancy A. Lynch,et al.  Consensus in the presence of partial synchrony , 1988, JACM.

[4]  Rachid Guerraoui,et al.  Abstractions for devising Byzantine-resilient state machine replication , 2000, Proceedings 19th IEEE Symposium on Reliable Distributed Systems SRDS-2000.

[5]  Rachid Guerraoui,et al.  Non-blocking atomic commit in asynchronous distributed systems with failure detectors , 2002, Distributed Computing.

[6]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[7]  Michael K. Reiter,et al.  Secure agreement protocols: reliable and atomic group multicast in rampart , 1994, CCS '94.

[8]  Louise E. Moser,et al.  The SecureRing protocols for securing group communication , 1998, Proceedings of the Thirty-First Hawaii International Conference on System Sciences.

[9]  Rachid Guerraoui,et al.  The Generic Consensus Service , 2001, IEEE Trans. Software Eng..

[10]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[11]  Rachid Guerraoui Revistiting the Relationship Between Non-Blocking Atomic Commitment and Consensus , 1995, WDAG.

[12]  Sam Toueg,et al.  The weakest failure detector for solving consensus , 1996, JACM.

[13]  Assia Doudou Abstractions for Byzantine-resilient state machine replication , 2000 .

[14]  Rachid Guerraoui,et al.  Consensus service: a modular approach for building agreement protocols in distributed systems , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[15]  Adi Shamir,et al.  A method for obtaining digital signatures and public-key cryptosystems , 1978, CACM.

[16]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[17]  Rachid Guerraoui,et al.  Muteness Failure Detectors: Specification and Implementation , 1999, EDCC.

[18]  Louise E. Moser,et al.  Solving Consensus in a Byzantine Environment Using an Unreliable Fault Detector , 1997, OPODIS.

[19]  Michael K. Reiter,et al.  Unreliable intrusion detection in distributed computations , 1997, Proceedings 10th Computer Security Foundations Workshop.

[20]  André Schiper Early consensus in an asynchronous system with a weak failure detector , 1997, Distributed Computing.

[21]  Sam Toueg,et al.  Randomized Byzantine Agreements , 1984, PODC '84.