Reliability of checkpointed real-time systems using time redundancy

Real-time computers are often used in embedded, life-critical applications where high reliability is important. A common approach to making such systems dependable is to vote on redundant processors executing multiple copies of the same task is described. The processors which make up such voted systems are subjected not only to independently occurring permanent and transient failure, but also to correlated transients brought about by electromagnetic interference from the operating environment. To counteract these transients, checkpointing and time redundancy are required, in addition to processor redundancy. This work analyzes the use of time and device redundancy in systems subject to correlated failure. The tradeoffs in checkpoint placement in such a system are found to be considerably different from those for non-redundant systems without real-time constraints. The authors compare fault-tolerant designs and without a rollback capability, accounting for the increased hardware-failure rate due to processor duplication when faults are detected in hardware, and the doubled execution times when detection is implemented in software. >

[1]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[2]  K. Mani Chandy,et al.  Analytic models for rollback and recovery strategies in data base systems , 1975, IEEE Transactions on Software Engineering.

[3]  K. Mani Chandy,et al.  A Survey of Analytic Models of Rollback and Recovery Stratergies , 1975, Computer.

[4]  A.L. Hopkins,et al.  FTMP—A highly reliable fault-tolerant multiprocess for aircraft , 1978, Proceedings of the IEEE.

[5]  J. Goldberg,et al.  SIFT: Design and analysis of a fault-tolerant computer for aircraft control , 1978, Proceedings of the IEEE.

[6]  Erol Gelenbe,et al.  On the Optimum Checkpoint Interval , 1979, JACM.

[7]  Kang G. Shin,et al.  Optimization criteria for checkpoint placement , 1984, CACM.

[8]  Kang G. Shin,et al.  A unified method for evaluating real-time computer controllers and its application , 1985 .

[9]  Jaynarayan H. Lala,et al.  Hardware and software fault tolerance: a unified architectural approach , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[10]  Chris J. Walter,et al.  The MAFT Architecture for Distributed Fault Tolerance , 1988, IEEE Trans. Computers.

[11]  Philip A. Bernstein,et al.  Sequoia: a fault-tolerant tightly coupled multiprocessor for transaction processing , 1988, Computer.

[12]  Jean Arlat,et al.  Definition and analysis of hardware- and software-fault-tolerant architectures , 1990, Computer.

[13]  Daniel P. Siewiorek Fault tolerance in commercial computers , 1990, Computer.