Threshold-Based Mechanisms to Discriminate Transient from Intermittent Faults

This paper presents a class of count-and-threshold mechanisms, collectively named /spl alpha/-count, which are able to discriminate between transient faults and intermittent faults in computing systems. For many years, commercial systems have been using transient fault discrimination via threshold-based techniques. We aim to contribute to the utility of count-and-threshold schemes, by exploring their effects on the system. We adopt a mathematically defined structure, which is simple enough to analyze by standard tools. /spl alpha/-count is equipped with internal parameters that can be tuned to suit environmental variables (such as transient fault rate, intermittent fault occurrence patterns). We carried out an extensive behavior analysis for two versions of the count-and-threshold scheme, assuming, first, exponentially distributed fault occurrencies and, then, more realistic fault patterns.

[1]  Nandakurnar N. Tendolkar,et al.  Automated diagnostic methodology for the IBM 3081 processor complex , 1982 .

[2]  M.A. Qureshi,et al.  The UltraSAN Modeling Environment , 1995, Perform. Evaluation.

[3]  Ram Chillarege,et al.  Design for fault-tolerance in system ES model 900 , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[4]  Daniel P. Siewiorek,et al.  Error log analysis: statistical modeling and heuristic trend analysis , 1990 .

[5]  Giorgio Mongardi DEPENDABLE COMPUTING FOR RAILWAY CONTROL SYSTEMS , 1993 .

[6]  Andrea Bondavalli,et al.  Dependability Modeling and Analysis of Complex Control Systems: An Application to Railway Interlocking , 1996, EDCC.

[7]  Ravishankar K. Iyer,et al.  Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data , 1990, IEEE Trans. Computers.

[8]  Jaynarayan H. Lala,et al.  Hardware and software fault tolerance: a unified architectural approach , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[9]  William H. Sanders,et al.  A Unified Approach for Specifying Measures of Performance, Dependability and Performability , 1991 .

[10]  D. P. Siewiorek,et al.  Modification of" Error Log Analysis: Statistical Modeling and , 1992 .

[11]  Andy J. Wellings,et al.  GUARDS: A Generic Upgradable Architecture for Real-Time Dependable Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[12]  Andrea Bondavalli,et al.  Discriminating fault rate and persistency to improve fault treatment , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[13]  Janusz Sosnowski,et al.  Transient fault tolerance in digital systems , 1994, IEEE Micro.

[14]  Jean-Claude Laprie,et al.  Dependability — Its Attributes, Impairments and Means , 1995 .

[15]  A. Module,et al.  Automated Diagnostic Methodology for the IBM 3081 Processor Complex , 1982 .

[16]  Prathima Agrawal,et al.  Fault Tolerance in Multiprocessor Systems without Dedicated Redundancy , 1988, IEEE Trans. Computers.