Automatically Increasing the Fault-Tolerance of Distributed Algorithms

Abstract The design of fault-tolerant distributed systems is a costly and difficult task. Its cost and difficulty increase dramatically with the severity of failures that a system must tolerate. This task is simplified through methods that automatically translate protocols tolerant of “benign” failures into ones tolerant of more “severe” failures. This paper describes two new translation mechanisms for synchronous systems: one translates protocols tolerant of crash failures into protocols tolerant of general omission failures and the other from general omission failures to arbitrary failures. Together these can be used to translate any protocol tolerant of the most benign failures into a protocol tolerant of the most severe. In addition, the paper also shows lower bounds on the fault-tolerance of translations between certain systems. These lower bounds are matched by some of the translations given, which are thus optimal with respect to fault-tolerance.

[1]  Adi Shamir,et al.  A method for obtaining digital signatures and public-key cryptosystems , 1978, CACM.

[2]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[3]  Sam Toueg,et al.  Distributed agreement in the presence of processor and communication faults , 1986, IEEE Transactions on Software Engineering.

[4]  Sam Toueg,et al.  Randomized Byzantine Agreements , 1984, PODC '84.

[5]  Gabriel Bracha,et al.  Asynchronous Byzantine Agreement Protocols , 1987, Inf. Comput..

[6]  Brian A. Coan,et al.  Achieving consensus in fault-tolerant distributed computer systems: protocols, lower bounds, and simulations , 1987 .

[7]  Danny Dolev,et al.  The Byzantine Generals Strike Again , 1981, J. Algorithms.

[8]  Sam Toueg,et al.  Fast Distributed Agreement , 1987, SIAM J. Comput..

[9]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1985, JACM.

[10]  Danny Dolev,et al.  Authenticated Algorithms for Byzantine Agreement , 1983, SIAM J. Comput..

[11]  Michael O. Rabin,et al.  Randomized byzantine generals , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[12]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[13]  Nancy A. Lynch,et al.  Consensus in the presence of partial synchrony , 1988, JACM.

[14]  Özalp Babaoglu,et al.  Streets of Byzantium: Network Architectures for Fast Reliable Broadcasts , 1985, IEEE Transactions on Software Engineering.

[15]  Brian A. Coan,et al.  A Compiler that Increases the Fault Tolerance of Asynchronous Protocols , 1988, IEEE Trans. Computers.

[16]  Brian A. Coan,et al.  A communication-efficient canonical form for fault-tolerant distributed protocols , 1986, PODC '86.