Software fault tolerance in real-time systems

Abstract The paper proposes a technique for providing software fault tolerance in real-time applications demanding fast response and a high degree of reliability. It is shown that a reasonably flexible interprocess communication can be supported with only a small increase in complexity and overhead. The two most prominent features of the proposed scheme are (1) it attempts to exploit fault-avoidance techniques as much as possible to reduce the overhead of fault tolerance and (2) it controls the propagation of errors so as to enable efficient recovery. Formal proofs of the system operation are developed. Besides showing that the scheme works as expected, the arguments serve to highlight the assumptions needed for provably correct operation. Some issues relating to hardware fault tolerance are also considered.

[1]  N. Ghani,et al.  A Recovery Cache for the PDP-11 , 1980, IEEE Transactions on Computers.

[2]  Brian Randell System structure for software fault tolerance , 1975 .

[3]  John C. Knight,et al.  A Framework for Software Fault Tolerance in Real-Time Systems , 1983, IEEE Transactions on Software Engineering.

[4]  Robert S. Swarz,et al.  The theory and practice of reliable system design , 1982 .

[5]  Algirdas Avizienis,et al.  The N-Version Approach to Fault-Tolerant Software , 1985, IEEE Transactions on Software Engineering.

[6]  Peter A. Barrett,et al.  Software Fault Tolerance: An Evaluation , 1985, IEEE Transactions on Software Engineering.

[7]  John E. Dobson,et al.  Building Reliable Secure Computing Systems out of Unreliable Insecure Components , 1986, IEEE Symposium on Security and Privacy.

[8]  Eric C. Cooper Replicated distributed programs , 1985, SOSP 1985.

[9]  Abraham Silberschatz,et al.  Error Propagation and Recovery in Concurrent Environments , 1985, Computer/law journal.

[10]  Geneva G. Belford,et al.  SIMULATIONS OF A FAULT-TOLERANT DEADLINE MECHANISM. , 1979 .

[11]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[12]  K. H. Kim,et al.  Distributed Execution of Recovery Blocks: An Approach to Uniform Treatment of Hardware and Software Faults , 1984, IEEE International Conference on Distributed Computing Systems.

[13]  Flaviu Cristian,et al.  Exception Handling and Software Fault Tolerance , 1982, IEEE Transactions on Computers.

[14]  Roy H. Campbell,et al.  Atomic actions for fault-tolerance using CSP , 1986, IEEE Transactions on Software Engineering.