On the reliability of consensus-based fault-tolerant distributed computing systems

The designer of a fault-tolerant distributed system faces numerous alternatives. Using a stochastic model of processor failure times, we investigate design choices such as replication level, protocol running time, randomized versus deterministic protocols, fault detection, and authentication. We use the probability with which a system produces the correct output as our evaluation criterion. This contrasts with previous fault-tolerance results that guarantee correctness only if the percentage of faulty processors in the system can be bounded. Our results reveal some subtle and counterintuitive interactions between the design parameters and system reliability.

[1]  David K. Gifford,et al.  The space shuttle primary computer system , 1984, CACM.

[2]  Leslie Lamport,et al.  Paradigms for Distributed Programs , 1984, Advanced Course: Distributed Systems.

[3]  John A. Stankovic,et al.  A Perspective on Distributed Computer Systems , 1984, IEEE Transactions on Computers.

[4]  Özalp Babaoglu Stopping Times of Distributed Consensus Protocols: A Probabilistic Analysis , 1987, Inf. Process. Lett..

[5]  Danny Dolev,et al.  Authenticated Algorithms for Byzantine Agreement , 1983, SIAM J. Comput..

[6]  Michael O. Rabin,et al.  Randomized byzantine generals , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[7]  Özalp Babaoglu,et al.  On the Optimum Checkpoint Selection Problem , 1984, SIAM J. Comput..

[8]  Whitfield Diffie,et al.  New Directions in Cryptography , 1976, IEEE Trans. Inf. Theory.

[9]  Yoram Moses,et al.  Knowledge and common knowledge in a Byzantine environment I: crash failures , 1986 .

[10]  Feller William,et al.  An Introduction To Probability Theory And Its Applications , 1950 .

[11]  Leslie Lamport,et al.  Using Time Instead of Timeout for Fault-Tolerant Distributed Systems. , 1984, TOPL.

[12]  Michael J. Fischer,et al.  The Consensus Problem in Unreliable Distributed Systems (A Brief Survey) , 1983, FCT.

[13]  SpectorAlfred,et al.  The space shuttle primary computer system , 1984 .

[14]  N. D. Durie,et al.  Digest of papers , 1976 .

[15]  P. M. Melliar-Smith,et al.  Synchronizing clocks in the presence of faults , 1985, JACM.

[16]  Fred B. Schneider,et al.  Byzantine generals in action: implementing fail-stop processors , 1984, TOCS.

[17]  Robert S. Swarz,et al.  The theory and practice of reliable system design , 1982 .

[18]  Vassos Hadzilacos,et al.  Issues of fault tolerance in concurrent computations (databases, reliability, transactions, agreement protocols, distributed computing) , 1985 .

[19]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[20]  B SchneiderFred Byzantine generals in action , 1984 .

[21]  Ben-Zion Chor,et al.  Arithmetic of Finite Fields , 1982, Inf. Process. Lett..

[22]  Hector Garcia-Molina,et al.  Applications of Byzantine agreement in database systems , 1986, TODS.

[23]  Yoram Moses,et al.  Knowledge and Common Knowledge in a Byzantine Environment I: Crash Failures , 1986, TARK.

[24]  Nancy A. Lynch,et al.  A Lower Bound for the Time to Assure Interactive Consistency , 1982, Inf. Process. Lett..

[25]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[26]  Fred B. Schneider,et al.  Inexact agreement: accuracy, precision, and graceful degradation , 1985, PODC '85.

[27]  Danny Dolev,et al.  'Eventual' is earlier than 'immediate' , 1982, 23rd Annual Symposium on Foundations of Computer Science (sfcs 1982).

[28]  Brian A. Coan,et al.  A Simple and Efficient Randomized Byzantine Agreement Algorithm , 1985, IEEE Transactions on Software Engineering.

[29]  Leslie Lamport,et al.  Reaching Agreement in the Presence of Faults , 1980, JACM.

[30]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1967 .

[31]  Michael Ben-Or,et al.  Fast asynchronous Byzantine agreement (extended abstract) , 1985, PODC '85.

[32]  Y. C. Tay The Reliability of (k, n)-Resilient Distributed Systems , 1984, Symposium on Reliability in Distributed Software and Database Systems.

[33]  Gabriel Bracha,et al.  An O(lg n) expected rounds randomized Byzantine generals protocol , 1985, STOC '85.

[34]  Leslie Lamport,et al.  Distributed Systems: Methods and Tools for Specification, An Advanced Course, April 3-12, 1984 and April 16-25, 1985, Munich, Germany , 1985, Advanced Course: Distributed Systems.