Diagnosability and diagnosis of algorithm-based fault tolerant systems

Multiprocessor architectures are now in common use for signal processing and other computation-intensive applications. These applications are characterized by high-speed data processing and/or long processing periods. It is therefore desirable that any erroneous data produced by the system be detected as quickly as possible, and that the faulty processors producing the erroneous data be located and reconfigured out of the system. Algorithm-based fault tolerance (ABFT) is a low-cost, system-level concurrent error detection scheme that can also be used for locating faulty processors. Graph-theoretic and matrix-based models have been developed for the analysis of systems using ABFT. These models are used to analyze a system for its fault diagnosability. Methods used in the analysis of multiprocessor systems using system-level diagnosis are applied to the analysis of ABFT systems. Using these methods, an improved diagnosability algorithm is provided. An efficient diagnosis algorithm for ABFT systems for identifying the faulty processors, if any exist, from the information available is given. No such algorithm was known before.<<ETX>>

[1]  M. Malek,et al.  A Fault-Tolerant Systolic Sorter , 1988, IEEE Trans. Computers.

[2]  Suku Nair,et al.  General linear codes for fault-tolerant matrix operations on processor arrays , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[3]  Charles R. Kime,et al.  System Fault Diagnosis: Closure and Diagnosability with Repair , 1975, IEEE Transactions on Computers.

[4]  Miroslaw Malek,et al.  A Fault-Tolerant FFT Processor , 1988, IEEE Trans. Computers.

[5]  Carlos R. P. Hartmann,et al.  A novel concurrent error detection scheme for FFT networks , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[6]  Jacob A. Abraham,et al.  Fault-Tolerant FFT Networks , 1988, IEEE Trans. Computers.

[7]  V. S. Sukumaran Nair Analysis and design of algorithm-based fault-tolerant systems , 1990 .

[8]  Gerald M. Masson,et al.  An 0(n2.5) Fault Identification Algorithm for Diagnosable Systems , 1984, IEEE Transactions on Computers.

[9]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[10]  Jacob A. Abraham,et al.  Bounds on Algorithm-Based Fault Tolerance in Multiple Processor Systems , 1986, IEEE Transactions on Computers.

[11]  Jacob A. Abraham,et al.  A Model For The Analysis Of Fault-Tolerant Signal Processing Architectures , 1988, Optics & Photonics.

[12]  Suku Nair,et al.  Real-Number Codes for Bault-Tolerant Matrix Operations On Processor Arrays , 1990, IEEE Trans. Computers.

[13]  Prithviraj Banerjee A theory for algorithm-based fault tolerance in array processor systems (checks, bounds, graph-theoretic, errors) , 1984 .

[14]  GERNOT METZE,et al.  On the Connection Assignment Problem of Diagnosable Systems , 1967, IEEE Trans. Electron. Comput..

[15]  Charles R. Kime,et al.  System Fault Diagnosis: Masking, Exposure, and Diagnosability Without Repair , 1975, IEEE Transactions on Computers.

[16]  Bapiraju Vinnakota,et al.  A dependence graph-based approach to the design of algorithm-based fault tolerant systems , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[17]  Suku Nair,et al.  Hierarchical design and analysis of fault-tolerant multiprocessor systems using concurrent error detection , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[18]  Suku Nair,et al.  Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor , 1990, IEEE Trans. Computers.

[19]  J.A. Abraham,et al.  Fault-tolerant matrix arithmetic and signal processing on highly concurrent computing structures , 1986, Proceedings of the IEEE.

[20]  Franklin T. Luk,et al.  A Linear Algebraic Model of Algorithm-Based Fault Tolerance , 1988, IEEE Trans. Computers.

[21]  Jacob A. Abraham,et al.  CONCURRENT FAULT DIAGNOSIS IN MULTIPLE PROCESSOR SYSTEMS. , 1986 .