Fault-Tolerant Matrix Operations On Multiple Processor Systems Using Weighted Checksums

Hardware for performing matrix operations at high speeds is in great demand in signal and image processing and in many real-time and scientific applications. VLSI technology has made it possible to perform fast large-scale vector and matrix computations by using multiple copies of low-cost processors. Since any functional error in a high performance system may seriously jeopardize the operation of the system and its data integrity, some level of fault-tolerance must be obtained to ensure that the results of long computations are valid. A low-cost checksum scheme had been proposed to obtain fault-tolerant matrix operations on multiple processor systems. However, this scheme can only correct errors in matrix multiplication; it can detect, but not correct errors in matrix-vector multiplication, LU-decomposition, and matrix inversion. In order to solve these problems with the checksum scheme, a very general matrix encoding scheme is proposed in this paper to achieve fault-tolerant matrix operations with multiple processor systems. Since many signal and image processing algorithms involving a "multiply-and-accumulate" type of expression can be transformed into matrix-vector multiplication operations and executed in a linear array, this scheme is extremely useful for cost-effective and fault-tolerant signal and image processing.

[1]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[2]  Jacob A. Abraham,et al.  LBW COST SCEEMES FOR FAULT TOLEEANCE IN MATRIX OPERATIONS WITH PROCESSOR ARRAYS , 1982 .

[3]  Charles M. Rader,et al.  Number theory in digital signal processing , 1979 .

[4]  David J. Kuck,et al.  On Stable Parallel Linear System Solvers , 1978, JACM.

[5]  Lynn Conway,et al.  Introduction to VLSI systems , 1978 .

[6]  Takayuki Kimura,et al.  Decentralized parallel algorithms for matrix computation , 1978, ISCA '78.

[7]  Kai Hwang,et al.  PUMPS Architecture for Pattern Analysis and Image Database Management , 1982, IEEE Transactions on Computers.

[8]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[9]  J. H. Wilkinson The algebraic eigenvalue problem , 1966 .

[10]  Michael A. Malcolm,et al.  On accurate floating-point summation , 1971, CACM.

[11]  Kenneth E. Batcher,et al.  Design of a Massively Parallel Processor , 1980, IEEE Transactions on Computers.

[12]  Gernot Metze,et al.  Fault Detection Capabilities of Alternating Logic , 1978, IEEE Transactions on Computers.

[13]  W. W. Peterson,et al.  Error-Correcting Codes. , 1962 .

[14]  E. V. Krishnamurtht Matrix processors using p-ADIC arithmetic for exact linear computations , 1975, 1975 IEEE 3rd Symposium on Computer Arithmetic (ARITH).

[15]  John J. Shedletsky,et al.  Error Correction by Alternate-Data Retry , 1978, IEEE Transactions on Computers.

[16]  Janak H. Patel,et al.  Concurrent Error Detection in ALU's by Recomputing with Shifted Operands , 1982, IEEE Transactions on Computers.

[17]  L. Dickson History of the Theory of Numbers , 1924, Nature.

[18]  J. von Neumann,et al.  Probabilistic Logic and the Synthesis of Reliable Organisms from Unreliable Components , 1956 .

[19]  D. A. Anderson,et al.  Design of self-checking digital networks using coding techniques , 1971 .

[20]  Prescott D. Crout A Short Method for Evaluating Determinants and Solving Systems of Linear Equations With Real or Complex Coefficients , 1941, Transactions of the American Institute of Electrical Engineers.

[21]  Robert D. Skeel,et al.  Scaling for Numerical Stability in Gaussian Elimination , 1979, JACM.

[22]  Kai Hwang,et al.  Partitioned Matrix Algorithms for VLSI Arithmetic Systems , 1982, IEEE Transactions on Computers.