Adaptive redundancy for fault-tolerant real-time systems

Reliability is an important aspect of real-time systems because the result of a real-time application may be valid only if the application functions correctly and its timing constraints are satisfied. There are two kinds of faults: hardware and software faults. In this paper, we consider hardware transient faults. Full replication or full hardware redundancy can achieve a high degree of reliability; however, it may waste resources. We propose a fault-tolerance approach, a hybrid method of rollback and replication, for the real-time systems which require both system reliability and the guarantee of meeting deadlines. We define that a task is fault-tolerant if it can be recovered from a transient error either by rollback or duplication. Our approach attempts to make as many tasks fault-tolerant as possible.

[1]  Ashok K. Agrawala,et al.  Resilient computation graphs for distributed real-time environments , 1991 .

[2]  C. V. Ramamoorthy,et al.  Rollback and Recovery Strategies for Computer Programs , 1972, IEEE Transactions on Computers.

[3]  Ravishankar K. Iyer,et al.  A STATISTICAL LOAD DEPENDENCY MODEL FOR CPU ERRORS AT SLAC , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[4]  Kang G. Shin,et al.  Design and Evaluation of a Fault-Tolerant Multiprocessor Using Hardware Recovery Blocks , 1984, IEEE Transactions on Computers.

[5]  Brian Randell System structure for software fault tolerance , 1975 .

[6]  Kewal K. Saluja,et al.  A watchdog processor based general rollback technique with multiple retries , 1986, IEEE Transactions on Software Engineering.

[7]  Kang G. Shin,et al.  Optimal Checkpointing of Real-Time Tasks , 1987, IEEE Transactions on Computers.

[8]  Kewal K. Saluja,et al.  An experimental study to determine task size for rollback recovery systems , 1988 .

[9]  Santosh K. Shrivastava,et al.  Using objects and actions to provide fault tolerance in distributed, real-time applications , 1991, [1991] Proceedings Twelfth Real-Time Systems Symposium.

[10]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.