Atomic Broadcast: From Simple Message Diffusion to Byzantine Agreement

In distributed systems subject to random communication delays and component failures, atomic broadcast can be used to implement the abstraction of synchronous replicated storage, a distributed storage that displays the same contents at every correct processor as of any clock time. This paper presents a systematic derivation of a family of atomic broadcast protocols that are tolerant of increasingly general failure classes: omission failures, timing failures, and authentication-detectable Byzantine failures. The protocols work for arbitrary point-to-point network topologies, and can tolerate any number of link and process failures up to network partitioning. After proving their correctness, we also prove two lower bounds that show that the protocols provide in many cases the best possible termination times.

[1]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[2]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[3]  Adrian Segall,et al.  Distributed network protocols , 1983, IEEE Trans. Inf. Theory.

[4]  Danny Dolev,et al.  Authenticated Algorithms for Byzantine Agreement , 1983, SIAM J. Comput..

[5]  Michael J. Fischer,et al.  The Consensus Problem in Unreliable Distributed Systems (A Brief Survey) , 1983, FCT.

[6]  Adi Shamir,et al.  A method for obtaining digital signatures and public-key cryptosystems , 1978, CACM.

[7]  Flaviu Cristian,et al.  Correct and Robust Programs , 1984, IEEE Transactions on Software Engineering.

[8]  Danny Dolev,et al.  Fault-tolerant clock synchronization , 1984, PODC '84.

[9]  Leslie Lamport,et al.  Using Time Instead of Timeout for Fault-Tolerant Distributed Systems. , 1984, TOPL.

[10]  Jo-Mei Chang,et al.  Reliable broadcast protocols , 1984, TOCS.

[11]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1985, JACM.

[12]  Fred B. Schneider Abstractions for Fault Tolerance in Distributed Systems , 1986 .

[13]  Hector Garcia-Molina,et al.  Recovery in a Triple Modular Redundant Database System , 1987, ICDCS.

[14]  Fred B. Schneider,et al.  Understanding Protocols for Byzantine Clock Synchronization , 1987 .

[15]  Kenneth P. Birman,et al.  Reliable communication in the presence of failures , 1987, TOCS.

[16]  Flaviu Cristian,et al.  Agreeing on who is present and who is absent in a synchronous distributed system , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[17]  Flaviu Cristian,et al.  New latency bounds for atomic broadcast , 1990, [1990] Proceedings 11th Real-Time Systems Symposium.

[18]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[19]  Nancy A. Lynch,et al.  Bounds on the time to reach agreement in the presence of timing uncertainty , 1991, STOC '91.

[20]  Stephen Ponzio Consensus in the presence of timing uncertainty: omission and Byzantine failures (extended abstract) , 1991, PODC '91.

[21]  Sape J. Mullender,et al.  Distributed systems (2nd Ed.) , 1993 .

[22]  Sam Toueg,et al.  Fault-tolerant broadcasts and related problems , 1993 .

[23]  F. Cristian,et al.  ATOMIC BROADCAST: FROM SIMPLE MESSAGE DIFFUSION TO BYZANTINE AGREEMENT , 1995 .

[24]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.