Mantissa-Preserving Operations and Robust Algorithm-Based Fault Tolerance for Matrix Computations

A system-level method for achieving fault tolerance called algorithm-based fault tolerance (ABFT) has been proposed by a number of researchers. Many ABFT schemes use a floating-point checksum test to detect computation errors resulting from hardware faults. This makes the tests susceptible to roundoff inaccuracies in floating-point operations, which either cause false alarms or lead to undetected errors. Thresholding of the equality test has been commonly used to avoid false alarms; however, a good threshold that minimizes false alarms without reducing the error coverage significantly is difficult to find, especially when not much is known about the input data. Furthermore, thresholded checksums will inevitably miss lower-bit errors, which can get magnified as a computation such as LU decomposition progresses. We develop a theory for applying integer mantissa checksum tests to "mantissa-preserving" floating-point computations. This test is not susceptible to roundoff problems and yields 100% error coverage without false alarms. For computations that are not fully mantissa-preserving, we show how to apply the mantissa checksum test to the mantissa-preserving components of the computation and the floating-point test to the rest of the computation. We apply this general methodology to matrix-matrix multiplication and LU decomposition (using the Gaussian elimination (GE) algorithm), and find that the accuracy of this new "hybrid" testing scheme is substantially higher than the floating-point test with thresholding.

[1]  Prithviraj Banerjee,et al.  Compiler-Assisted Synthesis of Algorithm-Based Checking in Multiprocessors , 1990, IEEE Trans. Computers.

[2]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[3]  Franklin T. Luk,et al.  An Analysis of Algorithm-Based Fault Tolerance Techniques , 1988, J. Parallel Distributed Comput..

[4]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[5]  António de Brito Ferrari Sparc® architecture, assembly language programming, & C : Richard P Paul Prentice-Hall Inc, Englewood Cliffs, NJ, USA (1994) ISBN 0 13 876889 7, £34.75, 448 pp , 1995, Microprocess. Microsystems.

[6]  J. H. Wilkinson The algebraic eigenvalue problem , 1966 .

[7]  Miroslaw Malek,et al.  A Fault-Tolerant FFT Processor , 1988, IEEE Trans. Computers.

[8]  Gene H. Golub,et al.  Matrix computations , 1983 .

[9]  E. Wright,et al.  An Introduction to the Theory of Numbers , 1939 .

[10]  David Goldberg,et al.  What every computer scientist should know about floating-point arithmetic , 1991, CSUR.

[11]  Suku Nair,et al.  Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor , 1990, IEEE Trans. Computers.

[12]  Franklin T. Luk Algorithm-based Fault Tolerance for Parallel Matrix Equation Solvers , 1986, Optics & Photonics.

[13]  Götz Alefeld Interval arithmetic tools and the precise scalar product in numerical analysis , 1990 .

[14]  Wolfgang Rülling,et al.  Exact accumulation of floating-point numbers , 1991, IEEE Symposium on Computer Arithmetic.

[15]  Amber Roy-Chowdhury,et al.  Tolerance determination for algorithm-based checks using simplified error analysis techniques , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[16]  Robert S. Swarz,et al.  The theory and practice of reliable system design , 1982 .

[17]  Willard L. Miranker,et al.  Computer arithmetic in theory and practice , 1981, Computer science and applied mathematics.

[18]  R. Comerford How DEC developed Alpha , 1992 .

[19]  Peter Kornerup,et al.  Semantics for exact floating point operations , 1991, [1991] Proceedings 10th IEEE Symposium on Computer Arithmetic.

[20]  Shantanu Dutt,et al.  More robust tests in algorithm-based fault-tolerant matrix multiplication , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[21]  James E. Smith,et al.  PowerPC 601 and Alpha 21064: a tale of two RISCs , 1994, Computer.

[22]  Suku Nair,et al.  Real-Number Codes for Bault-Tolerant Matrix Operations On Processor Arrays , 1990, IEEE Trans. Computers.

[23]  L. Geppert Platforms: the new contenders-not your father's CPU , 1993, IEEE Spectrum.

[24]  W. Miranker,et al.  The arithmetic of the digital computer: A new approach , 1986 .

[25]  James Hardy Wilkinson,et al.  Rounding errors in algebraic processes , 1964, IFIP Congress.

[26]  Ulrich W. Kulisch,et al.  Proposal for Accurate Floating-Point Vector Arithmetic , 1993 .

[27]  Brian Case,et al.  SPARC architecture , 1992 .

[28]  Gene H. Golub,et al.  Floating Point Fault Tolerance with Backward Error Assertions , 1995, IEEE Trans. Computers.