On the necessity of on-line-BIST in safety-critical applications-a case-study

This paper analyzes the effect of dormant faults on the mean time to failure (MTTF) of highly reliable systems. The analysis is performed by means of Markov models that allow quantifying the effect of dormant faults and other vital reliability parameters. It turns out that the presence of dormant faults can drastically reduce the MTTF of a system, particularly if the operating system allows a sporadic ("event-driven") change from a regular mode of operation to another mode. Virtually every practical system involves such a change, at least in case of emergency. It is demonstrated that on-line built-in self-test (BIST) is an effective means to overcome the deteriorating effect of dormant faults and re-establish a high MTTF. A very moderate test period may already be sufficient. The analysis Is performed for the example of a fail-silent communication system for safety-critical real-time applications.

[1]  Christopher Temple,et al.  Avoiding the babbling-idiot failure in a time-triggered communication system , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[2]  Bev Littlewood,et al.  Validation of ultrahigh dependability for software-based systems , 1993, CACM.

[3]  Bernard Courtois,et al.  A generalized theory of fail-safe systems , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[4]  Thomas Thurner,et al.  Time-triggered architecture for safety-related distributed real-time systems in transportation systems , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[5]  Daniel P. Siewiorek,et al.  Reliable Computer Systems: Design and Evaluation, Third Edition , 1998 .

[6]  Thomas F. Arnold,et al.  The Concept of Coverage and Its Effect on the Reliability Model of a Repairable System , 1973, IEEE Transactions on Computers.

[7]  Jean Arlat,et al.  Estimators for fault tolerance coverage evaluation , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[8]  Günter Grünsteidl,et al.  TTP - A Protocol for Fault-Tolerant Real-Time Systems , 1994, Computer.

[9]  Paulo Veríssimo,et al.  The Delta-4 approach to dependability in open distributed computing systems , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[10]  Kishor S. Trivedi,et al.  The Conservativeness of Reliability Estimates Based on Instantaneous Coverage , 1985, IEEE Transactions on Computers.