Graceful Degradation in Algorithm-Based Fault Tolerant Multiprocessor Systems

Algorithm-based fault tolerance (ABFT) is a technique which improves the reliability of a multiprocessor system by providing concurrent error detection and fault location capability to it. It encodes data at the system level and modifies the algorithm to operate on the encoded data in order to expose both transient and permanent faults in any processor. Work done till now in this area takes care of only the fault detection and location part of the problem. However, if spare processors are not available, then after a faulty processor has been located, the work initially assigned to it has to be mapped to some nonfaulty processors in the system in such a way that the fault tolerance capability of the system is still maintained with as small a degradation in performance as possible. In this paper, we propose an integrated deterministic solution to the above problem which combines concurrent error detection and fault location with graceful degradation. There exists no previous deterministic ABFT method for the design of general t-fault locating systems, even for the case of t=1. We propose a general method for designing one-fault locating/s-fault detecting systems. We use an extended model for representing ABFT systems. This model considers the processors computing the checks to be a part of the ABFT system, so that faults in the check computing processors can also be detected and located using a simple diagnosis algorithm, and the checks can be mapped to other nonfaulty processors in the system.

[1]  Niraj K. Jha,et al.  Optimal Design of Checks for Error Detection and Location in Fault-Tolerant Multiprocessor Systems , 1993, IEEE Trans. Computers.

[2]  J.A. Abraham,et al.  Fault-tolerant matrix arithmetic and signal processing on highly concurrent computing structures , 1986, Proceedings of the IEEE.

[3]  Niraj K. Jha,et al.  Analysis and Randomized Design of Algorithm-Based Fault Tolerant Multiprocessor Systems Under an Extended Model , 1997, IEEE Trans. Parallel Distributed Syst..

[4]  Bapiraju Vinnakota,et al.  Diagnosability and Diagnosis of Algorithm-Based Fault-Tolerant Systems , 1993, IEEE Trans. Computers.

[5]  S. S. Ravi,et al.  Design and analysis of test schemes for algorithm-based fault tolerance , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[6]  Bapiraju Vinnakota,et al.  Design of multiprocessor systems for concurrent error detection and fault diagnosis , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[7]  Jacob A. Abraham,et al.  CONCURRENT FAULT DIAGNOSIS IN MULTIPLE PROCESSOR SYSTEMS. , 1986 .

[8]  Suku Nair,et al.  Hierarchical design and analysis of fault-tolerant multiprocessor systems using concurrent error detection , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[9]  Jacob A. Abraham,et al.  A Model For The Analysis Of Fault-Tolerant Signal Processing Architectures , 1988, Optics & Photonics.

[10]  Andrzej Pelc,et al.  Almost Certain Fault Diagnosis Through Algorithm-Based Fault Tolerance , 1994, IEEE Trans. Parallel Distributed Syst..

[11]  Franklin T. Luk,et al.  An Analysis Of Algorithm-Based Fault Tolerance Techniques , 1986, Optics & Photonics.

[12]  Niraj K. Jha,et al.  Design of Algorithm-Based Fault Tolerant Systems with In-System Checks , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[13]  Jacob A. Abraham,et al.  Fault Tolerance Techniques for Systolic Arrays , 1987, Computer.

[14]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[15]  Jacob A. Abraham,et al.  Bounds on Algorithm-Based Fault Tolerance in Multiple Processor Systems , 1986, IEEE Transactions on Computers.

[16]  S. S. Ravi,et al.  Improved Bounds for Algorithm-Based Fault Tolerance , 1993, IEEE Trans. Computers.

[17]  Bapiraju Vinnakota,et al.  A dependence graph-based approach to the design of algorithm-based fault tolerant systems , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.