A Primer on Architectural Level Fault Tolerance

This paper introduces the fundamental concepts of fault tolerant computing. Key topics covered are voting, fault detection, clock synchronization, Byzantine Agreement, diagnosis, and reliability analysis. Low level mechanisms such as Hamming codes or low level communications protocols are not covered. The paper is tutorial in nature and does not cover any topic in detail. The focus is on rationale and approach rather than detailed exposition.

[1]  Dhiraj K. Pradhan,et al.  Fault-tolerant computing : theory and techniques , 1986 .

[2]  Marc A. Feldman,et al.  System Safety for the 21st Century , 2005 .

[3]  John M. Rushby,et al.  Bus Architectures for Safety-Critical Embedded Systems , 2001, EMSOFT.

[4]  W Butler Ricky,et al.  Formal Design and Verification of a Reliable Computing Platform For Real-Time Control (Phase 3 Results) , 1990 .

[5]  Ben L. Di Vito,et al.  Formal Design and Verification of a Reliable Computing Platform for Real-Time Control (Phase 2 Results) , 2003 .

[6]  Dale A. Mackall Development and flight test experiences with a flight-crucial digital control system , 1988 .

[7]  Ricky W. Butler,et al.  Techniques for Modeling the Reliability of Fault-Tolerant Systems With the Markov State-Space Approach , 1995 .

[8]  Sally C. Johnson,et al.  ASSIST: User's manual , 1986 .

[9]  Jaynarayan H. Lala,et al.  FAULT-TOLERANT PARALLEL PROCESSOR , 1991 .

[10]  Alfons Geser,et al.  A Unified Fault-Tolerance Protocol , 2004, FORMATS/FTRTFT.

[11]  Parameswaran Ramanathan,et al.  Fault-tolerant clock synchronization in distributed systems , 1990, Computer.

[12]  John Rushby,et al.  Formal Verification of a Fault Tolerant Clock Synchronization Algorithm , 1989 .

[13]  Randall Davis,et al.  Model-based reasoning: troubleshooting , 1988 .

[14]  Ricky W. Butler,et al.  The SURE approach to reliability analysis , 1992 .

[15]  Torres Wilfredo,et al.  Software Fault Tolerance: A Tutorial , 2000 .

[16]  Richard J. Boulton,et al.  Theorem Proving in Higher Order Logics , 2003, Lecture Notes in Computer Science.

[17]  Ricky W. Butler,et al.  SURE reliability analysis: Program and mathematics , 1988 .

[18]  Allan H. Johnston,et al.  Radiation effects predicted, observed, and compared for spacecraft systems , 2002, IEEE Radiation Effects Data Workshop.

[19]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[20]  Philip M. Thambidurai,et al.  Interactive consistency with multiple failure modes , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.

[21]  Håkan Sivencrona,et al.  Byzantine Fault Tolerance, from Theory to Reality , 2003, SAFECOMP.