Optimal discrimination between transient and permanent faults

An important practical problem in fault diagnosis is discriminating between permanent faults and transient faults. In many computer systems, the majority of errors are due to transient faults. Many heuristic methods have been used for discriminating between transient and permanent faults; however, we have found no previous work stating this decision problem in clear probabilistic terms. We present an optimal procedure for discriminating between transient and permanent faults, based on applying Bayesian inference to the observed events (correct and erroneous results). We describe how the assessed probability that a module is permanently faulty must vary with observed symptoms. We describe and demonstrate our proposed method on a simple application problem, building the appropriate equations and showing numerical examples. The method can be implemented as a run-time diagnosis algorithm at little computational cost; it can also be used to evaluate any heuristic diagnostic procedure by comparison.

[1]  Lorenzo Strigini,et al.  Adjudicators for diverse-redundant components , 1990, Proceedings Ninth Symposium on Reliable Distributed Systems.

[2]  Ram Chillarege,et al.  Design for fault-tolerance in system ES model 900 , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[3]  A. Module,et al.  Automated Diagnostic Methodology for the IBM 3081 Processor Complex , 1982 .

[4]  Lorenzo Strigini,et al.  Bayesian Diagnosis of Transient vs Permanent Faults , 1998 .

[5]  Giorgio Mongardi DEPENDABLE COMPUTING FOR RAILWAY CONTROL SYSTEMS , 1993 .

[6]  Andrea Bondavalli,et al.  Discriminating fault rate and persistency to improve fault treatment , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[7]  Robert S. Swarz,et al.  Reliable Computer Systems: Design and Evaluation , 1992 .

[8]  Ravishankar K. Iyer,et al.  Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data , 1990, IEEE Trans. Computers.

[9]  Walter L. Smith Probability and Statistics , 1959, Nature.

[10]  Jaynarayan H. Lala,et al.  Hardware and software fault tolerance: a unified architectural approach , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[11]  Daniel P. Siewiorek,et al.  Error log analysis: statistical modeling and heuristic trend analysis , 1990 .