SRC: soft error detection and recovery for high performance linpack

In high-performance systems, the probability of failure is higher for larger systems. Errors in calculations may occur that cannot be detected by any other means. To address this problem, we create a checksum-based approach that detects and recovers from calculation errors. We apply this approach to the LU factorization algorithm used by High Performance Linpack. Our approach has low overhead. In contrast to existing approaches that require repeated calculation, it repeats only a fraction of the calculation during recovery. The frequency of checking can be adjusted for the error rate, resulting in a flexible method of fault tolerance.

[1]  Franklin T. Luk,et al.  An Analysis of Algorithm-Based Fault Tolerance Techniques , 1988, J. Parallel Distributed Comput..

[2]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.