Analysis and Randomized Design of Algorithm-Based Fault Tolerant Multiprocessor Systems Under an Extended Model

Reliability of compute-intensive applications can be improved by introducing fault tolerance into the system. Algorithm based fault tolerance (ABFT) is a low-cost scheme which provides the required fault tolerance to the system through system level encoding. In this paper, we propose randomized construction techniques, under an extended model, for the design of ABFT systems with the required fault tolerance capability. The model considers failures in the processors performing the checking operations.

[1]  J.A. Abraham,et al.  Fault-tolerant matrix arithmetic and signal processing on highly concurrent computing structures , 1986, Proceedings of the IEEE.

[2]  S. S. Ravi,et al.  Design and analysis of test schemes for algorithm-based fault tolerance , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[3]  Jacob A. Abraham,et al.  A Model For The Analysis Of Fault-Tolerant Signal Processing Architectures , 1988, Optics & Photonics.

[4]  Niraj K. Jha,et al.  Optimal Design of Checks for Error Detection and Location in Fault-Tolerant Multiprocessor Systems , 1993, IEEE Trans. Computers.

[5]  Andrzej Pelc,et al.  Almost Certain Fault Diagnosis Through Algorithm-Based Fault Tolerance , 1994, IEEE Trans. Parallel Distributed Syst..

[6]  Niraj K. Jha,et al.  Design of Algorithm-Based Fault Tolerant Systems with In-System Checks , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[7]  B. Vinnakota Analysis, design, and synthesis of algorithm-based fault-tolerant systems , 1992 .

[8]  Jacob A. Abraham,et al.  Fault Tolerance Techniques for Systolic Arrays , 1987, Computer.

[9]  Niraj K. Jha,et al.  Graceful Degradation in Algorithm-Based Fault Tolerant Multiprocessor Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[10]  Franklin T. Luk,et al.  An Analysis of Algorithm-Based Fault Tolerance Techniques , 1988, J. Parallel Distributed Comput..

[11]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[12]  Jacob A. Abraham,et al.  Bounds on Algorithm-Based Fault Tolerance in Multiple Processor Systems , 1986, IEEE Transactions on Computers.

[13]  S. S. Ravi,et al.  Improved Bounds for Algorithm-Based Fault Tolerance , 1993, IEEE Trans. Computers.

[14]  Bapiraju Vinnakota,et al.  Diagnosability and Diagnosis of Algorithm-Based Fault-Tolerant Systems , 1993, IEEE Trans. Computers.

[15]  Bapiraju Vinnakota,et al.  Design of multiprocessor systems for concurrent error detection and fault diagnosis , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[16]  Bapiraju Vinnakota,et al.  A dependence graph-based approach to the design of algorithm-based fault tolerant systems , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[17]  Jacob A. Abraham,et al.  CONCURRENT FAULT DIAGNOSIS IN MULTIPLE PROCESSOR SYSTEMS. , 1986 .

[18]  Suku Nair,et al.  Hierarchical design and analysis of fault-tolerant multiprocessor systems using concurrent error detection , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[19]  Amber Roy-Chowdhury,et al.  Tolerance determination for algorithm-based checks using simplified error analysis techniques , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.