Continual On-Line Diagnosis of Hybrid Faults

An accurate system-state determination is essential in ensuring system dependability. An imprecise state assessment can lead to catastrophic failure through optimistic diagnosis, or underutilization of resources due to pessimistic diagnosis. Dependability is usually achieved through a fault detection, isolation and reconfiguration (FDIR) paradigm, of which the diagnosis procedure is a primary component. Fault resolution in on-line diagnosis is key to providing an accurate system-state assessment. Most diagnostic strategies are based on limited fault models that adopt either an optimistic (all faults s-a-X) or pessimistic (all faults Byzantine) bias. Our Hybrid Fault-Effects Model (HFM) handles a continuum of fault types that are distinguished by their impact on system operations. While this approach has been shown to enhance system functionality and dependability, on-line FDIR is required to make the HFM practical. In this paper, we develop a methodology for utilization of the system-state information to provide continual on-line diagnosis and reconfiguration as an integral part of the system operations. We present diagnosis algorithms applicable under the generalized HFM and introduce the notion of fault decay time. Our diagnosis approach is based primarily on monitoring the system’s message traffic. Unlike existing approaches, no explicit test procedures are required.

[1]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[2]  Miroslaw Malek,et al.  The consensus problem in fault-tolerant computing , 1993, CSUR.

[3]  Chris J. Walter,et al.  MAFT: A Multicomputer Architecture for Fault-Tolerance in Real-Time Control Systems , 1989, RTSS.

[4]  Patrick Lincoln,et al.  A formally verified algorithm for interactive consistency under a hybrid fault model , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[5]  Chris J. Walter Identifying the cause of detected errors , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[6]  Philip M. Thambidurai,et al.  Interactive consistency with multiple failure modes , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.

[7]  Neeraj Suri,et al.  Reliability modeling of large fault-tolerant systems , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[8]  Kang G. Shin,et al.  DIAGNOSIS OF PROCESSORS WITH BYZANTINE FAULTS IN A DISTRIBUTED COMPUTING SYSTEM. , 1987 .

[9]  GERNOT METZE,et al.  On the Connection Assignment Problem of Diagnosable Systems , 1967, IEEE Trans. Electron. Comput..

[10]  Abhijit Sengupta,et al.  On Self-Diagnosable Multiprocessor Systems: Diagnosis by the Comparison Approach , 1992, IEEE Trans. Computers.