Numerical Analysis of Fixed Point Algorithms in the Presence of Hardware Faults

The exponential growth of computational power of the extreme scale machines over the past few decades has led to a corresponding decrease in reliability and a sharp increase of the frequency of hardware faults. Our research focuses on the mathematical challenges presented by the silent hardware faults; i.e., faults that can perturb the result of computations in an inconspicuous way. Using the approach of selective reliability, we present an analytic fault mode that can be used to study the resilience properties of a numerical algorithm. We apply our approach to the classical fixed point iteration and demonstrate that in the presence of hardware faults, the classical method fails to converge in expectation. We preset a modified resilient algorithm that detects and rejects faults resulting in error with large magnitude, while small faults are negated by the natural self-correcting properties of the algorithm. We show that our method is convergent (in first and second statistical moments) even in the presenc...

[1]  Jack Dongarra,et al.  Applied Mathematics Research for Exascale Computing , 2014 .

[2]  Prithviraj Banerjee,et al.  Algorithm-Based Error Detection Schemes for Iterative Solution of Partial Differential Equations , 1996, IEEE Trans. Computers.

[3]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[4]  F. Mueller,et al.  Quantifying the Impact of Single Bit Flips on Floating Point Arithmetic , 2013 .

[5]  Richard W. Vuduc,et al.  Self-stabilizing iterative solvers , 2013, ScalA '13.

[6]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[7]  Sudhanva Gurumurthi,et al.  Feng Shui of supercomputer memory positional effects in DRAM and SRAM faults , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[8]  Algirdas Avižienis,et al.  Fault-tolerance and fault-intolerance: Complementary approaches to reliable computing , 1975, Reliable Software.

[9]  Thomas Hérault,et al.  Algorithm-based fault tolerance for dense matrix factorizations , 2012, PPoPP '12.

[10]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[11]  Martin Schulz,et al.  Fault resilience of the algebraic multi-grid solver , 2012, ICS '12.

[12]  Kurt B. Ferreira,et al.  Fault-tolerant iterative methods via selective reliability. , 2011 .

[13]  Frank Mueller,et al.  Evaluating the Impact of SDC on the GMRES Iterative Solver , 2013, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[14]  The International Journal of High Performance Computing Applications— , 1998 .

[15]  Franklin T. Luk,et al.  A Linear Algebraic Model of Algorithm-Based Fault Tolerance , 1988, IEEE Trans. Computers.

[16]  J.A. Abraham,et al.  Fault-tolerant matrix arithmetic and signal processing on highly concurrent computing structures , 1986, Proceedings of the IEEE.

[17]  W. Rudin Principles of mathematical analysis , 1964 .

[18]  C. J. Anfinson,et al.  A linear algebraic model of algorithmic-based fault tolerance , 1988, [1988] Proceedings. International Conference on Systolic Arrays.

[19]  Ron Brightwell,et al.  Cooperative Application/OS DRAM Fault Recovery , 2011, Euro-Par Workshops.

[20]  Rakesh Kumar,et al.  Algorithmic approaches to low overhead fault detection for sparse linear algebra , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[21]  Bianca Schroeder,et al.  Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design , 2012, ASPLOS XVII.

[22]  Vilas Sridharan,et al.  A study of DRAM failures in the field , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[23]  Franck Cappello,et al.  Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..

[24]  Frank Mueller,et al.  Exploiting Data Representation for Fault Tolerance , 2014, 2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems.

[25]  Bronis R. de Supinski,et al.  Soft error vulnerability of iterative linear algebra methods , 2007, ICS '08.