Graceful degradation in algorithm-based fault tolerant multiprocessor systems

Algorithm-based fault tolerance (ABFT) is a technique for improving the reliability of a multiprocessor system by providing concurrent error detection and fault location capability to it. In this paper, we propose the first integrated solution to the problem of fault detection, location and graceful degradation in ABFT systems. Unlike most previous methods, we use an extended model for representing ABFT systems, which allows faults to occur in check computing processors.<<ETX>>

[1]  Franklin T. Luk,et al.  An Analysis of Algorithm-Based Fault Tolerance Techniques , 1988, J. Parallel Distributed Comput..

[2]  Niraj K. Jha,et al.  Diagnosability and diagnosis of algorithm-based fault tolerant systems , 1989, Proceedings of the 32nd Midwest Symposium on Circuits and Systems,.

[3]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[4]  Niraj K. Jha,et al.  Design of Algorithm-Based Fault Tolerant Systems with In-System Checks , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[5]  Jacob A. Abraham,et al.  Bounds on Algorithm-Based Fault Tolerance in Multiple Processor Systems , 1986, IEEE Transactions on Computers.

[6]  Daniel P. Siewiorek,et al.  Derivation and Calibration of a Transient Error Reliability Model , 1982, IEEE Transactions on Computers.

[7]  Niraj K. Jha,et al.  Optimal Design of Checks for Error Detection and Location in Fault-Tolerant Multiprocessor Systems , 1993, IEEE Trans. Computers.

[8]  Jacob A. Abraham,et al.  Fault Tolerance Techniques for Systolic Arrays , 1987, Computer.

[9]  Jacob A. Abraham,et al.  A Model For The Analysis Of Fault-Tolerant Signal Processing Architectures , 1988, Optics & Photonics.

[10]  Suku Nair,et al.  Hierarchical design and analysis of fault-tolerant multiprocessor systems using concurrent error detection , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[11]  Andrzej Pelc,et al.  Almost Certain Fault Diagnosis Through Algorithm-Based Fault Tolerance , 1994, IEEE Trans. Parallel Distributed Syst..

[12]  S. S. Ravi,et al.  Design and analysis of test schemes for algorithm-based fault tolerance , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[13]  Bapiraju Vinnakota,et al.  A dependence graph-based approach to the design of algorithm-based fault tolerant systems , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[14]  J.A. Abraham,et al.  Fault-tolerant matrix arithmetic and signal processing on highly concurrent computing structures , 1986, Proceedings of the IEEE.

[15]  Bapiraju Vinnakota,et al.  Design of multiprocessor systems for concurrent error detection and fault diagnosis , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[16]  Jacob A. Abraham,et al.  CONCURRENT FAULT DIAGNOSIS IN MULTIPLE PROCESSOR SYSTEMS. , 1986 .