Dealing with dormant faults in an embedded fault-tolerant computer system

Accumulation of dormant faults is a potential threat in a fault tolerant system, especially because most often fault tolerance is based on the single-fault assumption. We investigate this threat by the example of an automotive steer-by-wire application based on the Time-Triggered Architecture (TTA). By means of a Markov model we illustrate that the effect of fault dormancy can degrade the MTTF of a system by several orders of magnitude. We study potential remedies, of which transparent online testing proves to be the most powerful one, while taking a hot spare offline temporarily to test it provides a more feasible solution, though with tight constraints regarding the test duration.

[1]  Kishor S. Trivedi,et al.  The Conservativeness of Reliability Estimates Based on Instantaneous Coverage , 1985, IEEE Transactions on Computers.

[2]  Christopher Temple,et al.  Avoiding the babbling-idiot failure in a time-triggered communication system , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[3]  Karl Thaller A highly-efficient transparent online memory test , 2001, Proceedings International Test Conference 2001 (Cat. No.01CH37260).

[4]  Jacob Savir On-line and off-line test of airborne digital systems: a reliability study , 2000, Proceedings International Test Conference 2000 (IEEE Cat. No.00CH37159).

[5]  Robert S. Swarz,et al.  Reliable Computer Systems: Design and Evaluation , 1992 .

[6]  Michael Nicolaidis,et al.  Theory of Transparent BIST for RAMs , 1996, IEEE Trans. Computers.

[7]  Andreas Steininger,et al.  Economic Online Self-Test in the Time-Triggered Architecture , 1999, IEEE Des. Test Comput..

[8]  S. Duzellier,et al.  Heavy ions induced latent stuck bits revealed by total dose irradiation in 4T cells SRAMs , 1999, 1999 Fifth European Conference on Radiation and Its Effects on Components and Systems. RADECS 99 (Cat. No.99TH8471).

[9]  Paolo Prinetto,et al.  On-line Testing of an Off-the-shelf Microprocessor Board for Safety-critical Applications , 1996, EDCC.

[10]  Bev Littlewood,et al.  Validation of ultrahigh dependability for software-based systems , 1993, CACM.

[11]  Hermann Kopetz,et al.  Tolerating Arbitrary Node Failures in the Time-Triggered Architecture , 2001 .

[12]  J-C. Laprie,et al.  DEPENDABLE COMPUTING AND FAULT TOLERANCE : CONCEPTS AND TERMINOLOGY , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[13]  D. Powell,et al.  The Delta-4 Approach to Dependability in Open Distributed Computing Systems , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[14]  Hermann Kopetz,et al.  Dependability: Basic Concepts and Terminology , 1992 .

[15]  M. Nicolaidis,et al.  A GENERALIZED THEORY OF FAIL+SAFE SYSTEMS , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[16]  Andreas Steininger,et al.  How does resource utilization affect fault tolerance? , 2000, Proceedings IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems.

[17]  Thomas Thurner,et al.  Time-triggered architecture for safety-related distributed real-time systems in transportation systems , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[18]  Yervant Zorian,et al.  2001 Technology Roadmap for Semiconductors , 2002, Computer.

[19]  Günter Grünsteidl,et al.  TTP - A Protocol for Fault-Tolerant Real-Time Systems , 1994, Computer.

[20]  Andreas Steininger,et al.  On the necessity of on-line-BIST in safety-critical applications-a case-study , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[21]  Maurizio Rebaudengo,et al.  Industrial Track paper On-line Testing of an Off-the-shelf Microprocessor Board for Safety- critical Applications ~ , 1996 .

[22]  Jean Arlat,et al.  Estimators for fault tolerance coverage evaluation , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[23]  Jeffrey Alun Jones,et al.  A comparison of electronic-reliability prediction models , 1999 .

[24]  Thomas F. Arnold,et al.  The Concept of Coverage and Its Effect on the Reliability Model of a Repairable System , 1973, IEEE Transactions on Computers.