Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor

The design of fault-tolerant hypercube multiprocessor architecture is discussed. The authors propose the detection and location of faulty processors concurrently with the actual execution of parallel applications on the hypercube using a novel scheme of algorithm-based error detection. System-level error detection mechanisms have been implemented for three parallel applications on a 16-processor Intel iPSC hypercube multiprocessor: matrix multiplication, Gaussian elimination, and fast Fourier transform. Schemes for other applications are under development. Extensive studies have been done of error coverage of the system-level error detection schemes in the presence of finite-precision arithmetic, which affects the system-level encodings. Two reconfiguration schemes are proposed that allow the authors to isolate and replace faulty processors with spare processors. >

[1]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[2]  C. Weinstein Roundoff noise in floating point fast Fourier transform computation , 1969 .

[3]  Israel Koren A reconfigurable and fault-tolerant VLSI multiprocessor array , 1981, ISCA '81.

[4]  James R. Armstrong,et al.  Fault Diagnosis in a Boolean n Cube Array of Microprocessors , 1981, IEEE Transactions on Computers.

[5]  Ravishankar K. Iyer,et al.  PERMANENT CPU ERRORS AND SYSTEM ACTIVITY: MEASUREMENT AND MODELLING. , 1983 .

[6]  Jacob A. Abraham,et al.  Fault-secure algorithms for multiple-processor systems , 1984, ISCA '84.

[7]  David A. Rennels,et al.  Fault-Tolerant Computing—Concepts and Examples , 1984, IEEE Transactions on Computers.

[8]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[9]  Jacob A. Abraham,et al.  Fault-Tolerant Matrix Operations On Multiple Processor Systems Using Weighted Checksums , 1984, Optics & Photonics.

[10]  Prithviraj Banerjee,et al.  Fault-secure algorithms for multiple-processor systems , 1984, ISCA 1984.

[11]  Dhiraj K. Pradhan Fault-Tolerant Multiprocessor Link and Bus Network Architectures , 1985, IEEE Transactions on Computers.

[12]  Charles L. Seitz,et al.  The cosmic cube , 1985, CACM.

[13]  M. Heath,et al.  Matrix factorization on a hypercube multiprocessor , 1985 .

[14]  John C. Peterson,et al.  The Mark III Hypercube-Ensemble Concurrent Computer , 1985, International Conference on Parallel Processing.

[15]  Jacob A. Abraham,et al.  Bounds on Algorithm-Based Fault Tolerance in Multiple Processor Systems , 1986, IEEE Transactions on Computers.

[16]  Mariagiovanna Sami,et al.  Fault Tolerance Techniques for Array Structures Used in Supercomputing , 1986, Computer.

[17]  Franklin T. Luk Algorithm-based Fault Tolerance for Parallel Matrix Equation Solvers , 1986, Optics & Photonics.

[18]  Jacob A. Abraham,et al.  Fault-Tolerant Systems For The Computation Of Eigenvalues And Singular Values , 1986, Optics & Photonics.

[19]  Jacob A. Abraham,et al.  CONCURRENT FAULT DIAGNOSIS IN MULTIPLE PROCESSOR SYSTEMS. , 1986 .

[20]  John Paul Shen,et al.  Processor Control Flow Monitoring Using Signatured Instruction Streams , 1987, IEEE Transactions on Computers.

[21]  Jacob A. Abraham,et al.  Fault-Tolerant FFT Networks , 1988, IEEE Trans. Computers.

[22]  Miroslaw Malek,et al.  A Fault-Tolerant FFT Processor , 1988, IEEE Trans. Computers.

[23]  G. C. Fox,et al.  Solving Problems on Concurrent Processors , 1988 .

[24]  Edward J. McCluskey,et al.  Concurrent Error Detection Using Watchdog Processors - A Survey , 1988, IEEE Trans. Computers.

[25]  Prithviraj Banerjee,et al.  Algorithms-Based Fault Detection for Signal Processing Applications , 1990, IEEE Trans. Computers.