Task-scheduling strategies for reliable TMR controllers using task grouping and assignment

Real-time computers are often used in embedded, life-critical applications where high reliability is important. A common approach to make such systems dependable is to vote on redundant processors executing multiple copies of the same task. The most popular redundant structure is triple modular redundancy (TMR). The processors that make up such systems are subject not only to independently occurring permanent and transient faults, but to correlated transient faults, such as electromagnetic interference (EMI) caused by the operating environment. This paper proposes two new scheduling strategies for TMR computer-controllers. Both strategies can tolerate correlated faults as well as independent faults. These strategies, TMR-R (TMR with rotated task group) and TMR-Q (TMR with quintuple computation), are developed using task grouping and assignment. To evaluate the reliability of these strategies, a discrete-time Markov model for control systems is devised. Reliability equations for the TMR-R and TMR-Q are derived from state transitions of sampling intervals based on the Markov model. The reliability of these TMR is proved by comparing them with a conventional TMR, using numerical analysis. These proposed strategies are anticipated to be useful for control systems operating in harsh environments, such as controllers of airplanes or nuclear power plants.

[1]  M. Kameyama,et al.  Design of Dependent-Failure-Tolerant Microcomputer System Using Triple-Modular Redundancy , 1980 .

[2]  Hagbae Kim,et al.  Design and Analysis of an Optimal Instruction-Retry Policy for TMR Controller Computers , 1996, IEEE Trans. Computers.

[3]  P. C. Sharma,et al.  Modular TMR multiprocessor system , 1989 .

[4]  Herbert Hecht,et al.  Correlated Failures in Fault-Tolerant Computers , 1987, IEEE Transactions on Reliability.

[5]  Santosh K. Shrivastava,et al.  Reliable Computer Systems , 1985, Texts and Monographs in Computer Science.

[6]  C. M. Krishna,et al.  Optimal configuration of redundant real-time systems in the face of correlated failure , 1995 .

[7]  C. P. Fuhrman,et al.  Fault tolerance with multiple task modular redundancy , 1995 .

[8]  Hagbae Kim,et al.  Sequencing Tasks to Minimize the Effects of Near-Coincident Faults in TMR Controller Computers , 1996, IEEE Trans. Computers.

[9]  Yashwant K. Malaiya,et al.  Reliability Measure of Hardware Redundancy Fault-Tolerant Digital Systems with Intermittent Faults , 1981, IEEE Transactions on Computers.

[10]  John F. Wakerly,et al.  Transient Failures in Triple Modular Redundancy Systems with Sequential Modules , 1975, IEEE Transactions on Computers.

[11]  Israel Koren,et al.  Reliability Analysis of N-Modular Redundancy Systems with Intermittent and Permanent Faults , 1979, IEEE Transactions on Computers.

[12]  Robert S. Swarz,et al.  Reliable Computer Systems: Design and Evaluation , 1992 .

[13]  Kang G. Shin,et al.  An Optimal Retry Policy Based on Fault Classification , 1994, IEEE Trans. Computers.

[14]  Nikolaos Gaitanis The Design of Totally Self-Checking TMR Fault-Tolerant Systems , 1988, IEEE Trans. Computers.