Implementing Adaptive Fault-Tolerant Services for Hybrid Faults

The two major approaches to building fault-tolerant services are commonly known as the Primary-Backup approach (PB) and the State-Machine approach (SM). PB can tolerate crash and omission faults and runs more economically than SM, but SM can tolerate more serious faults, including arbitrary or Byzantine faults. Instead of selecting one or the other approach, thus either incurring a high running cost or risking the service becoming incorrect when unexpected faults occur, we advocate the approach of adaptive fault tolerance. We present algorithms that intelligently adapt between PB and SM, thus retaining (almost) the best of both worlds. Our adaptive approach is modular in that any PB or SM protocol can be used, and is also practical in that it can be easily incorporated into some existing systems.

[1]  Patrick Lincoln,et al.  A Formally Verified Algorithm for Interactive Consistency Under a Hybrid Fault Model , 1993, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[2]  J. Goldberg,et al.  SIFT: Design and analysis of a fault-tolerant computer for aircraft control , 1978, Proceedings of the IEEE.

[3]  Sam Toueg,et al.  Early-Stopping Distributed Bidding and Applications (Preliminary Version) , 1990, WDAG.

[4]  Piotr Berman,et al.  Optimal Early Stopping in Distributed Consensus (Extended Abstract) , 1992, WDAG.

[5]  Dhiraj K. Pradhan,et al.  Consensus With Dual Failure Modes , 1991, IEEE Trans. Parallel Distributed Syst..

[6]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[7]  J. D. Day,et al.  A principle for resilient sharing of distributed resources , 1976, ICSE '76.

[8]  Navin Budhiraja The Primary-Backup Approach: Lower and Upper Bounds , 1993 .

[9]  Philip M. Thambidurai,et al.  Interactive consistency with multiple failure modes , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.

[10]  Len T. Armstrong Adaptive Fault Tolerance , 1994 .

[11]  Danny Dolev,et al.  Early stopping in Byzantine agreement , 1990, JACM.

[12]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[13]  Fred B. Schneider,et al.  Optimal Primary-Backup Protocols , 1992, WDAG.

[14]  Juan A. Garay,et al.  A Continuum of Failure Models for Distributed Computing , 1992, WDAG.

[15]  Amr Elabbadi Implementing Fault-Tolerant Distributed Objects , 1985 .

[16]  Amr El Abbadi,et al.  Implementing Fault-Tolerant Distributed Objects , 1985, IEEE Transactions on Software Engineering.

[17]  Fred B. Schneider,et al.  Primary-Backup Protocols: Lower Bounds and Optimal Implementations , 1992 .

[18]  Kenneth P. Birman,et al.  The process group approach to reliable distributed computing , 1992, CACM.