Research on RTOS-Integrated TMR for Fault Tolerant Systems

Safety and availability are issues of major importance in many critical systems. A RTOS (realtime operating system)-integrated fault-tolerant system using TMR technology is presented in this paper. The system incorporates three homogeneous microcomputers and provides the fault-tolerant function through system-APIs to applications. As it is integrated with RTOS, the system is more general-purpose, and programmers need not pay too much attention to the fault tolerance technology. This system works in normal and degraded (duple or even single modular) modes, and can tolerate transient or permanent faults. The system also provides MultiTask-support fault-tolerant function, and reconfiguration after a fault occurs is transparent to applications. Meanwhile, a novel seamless software upgrade method through intelligent state-transition-control is brought forward.

[1]  Stephen L. Scott,et al.  Reliability-aware resource management for computational grid/cluster environments , 2005, The 6th IEEE/ACM International Workshop on Grid Computing, 2005..

[2]  A.L. Hopkins,et al.  FTMP—A highly reliable fault-tolerant multiprocess for aircraft , 1978, Proceedings of the IEEE.

[3]  Edward J. McCluskey,et al.  On-line testing and recovery in TMR systems for real-time applications , 2001, Proceedings International Test Conference 2001 (Cat. No.01CH37260).

[4]  Michael R. Lyu,et al.  Software fault tolerance in a clustered architecture: techniques and reliability modeling , 1999, 1999 IEEE Aerospace Conference. Proceedings (Cat. No.99TH8403).

[5]  Liu Fang,et al.  An OBSM Method for Real Time Embedded System , 2006, 2006 10th International Conference on Computer Supported Cooperative Work in Design.

[6]  J. Goldberg,et al.  SIFT: Design and analysis of a fault-tolerant computer for aircraft control , 1978, Proceedings of the IEEE.

[7]  J. Yoon,et al.  Time-redundant recovery policy of TMR failures using rollback and roll-forward methods , 2000 .

[8]  Ann T. Tai,et al.  COTS-based fault tolerance in deep space: Qualitative and quantitative analyses of a bus network architecture , 1999, Proceedings 4th IEEE International Symposium on High-Assurance Systems Engineering.

[9]  J.C. Muzio,et al.  Development of a fault tolerant flight control system , 2004, The 23rd Digital Avionics Systems Conference (IEEE Cat. No.04CH37576).

[10]  Kam Hong Shum Fault tolerant cluster computing through replication , 1997, Proceedings 1997 International Conference on Parallel and Distributed Systems.

[11]  Victor P. Nelson Fault-tolerant computing: fundamental concepts , 1990, Computer.

[12]  Chris J. Walter,et al.  The MAFT Architecture for Distributed Fault Tolerance , 1988, IEEE Trans. Computers.

[13]  André Schiper,et al.  Primary-backup replication: from a time-free protocol to a time-based implementation , 2001, Proceedings 20th IEEE Symposium on Reliable Distributed Systems.