Cooperative Application/OS DRAM Fault Recovery

Exascale systems will present considerable fault-tolerance challenges to applications and system software. These systems are expected to suffer several hard and soft errors per day. Unfortunately, many fault-tolerance methods in use, such as rollback recovery, are unsuitable for many expected errors, for example DRAM failures. As a result, applications will need to address these resilience challenges to more effectively utilize future systems. In this paper, we describe work on a cross-layer application / OS framework to handle uncorrected memory errors. We illustrate the use of this framework through its integration with a new fault-tolerant iterative solver within the Trilinos library, and present initial convergence results.

[1]  Dave Dopson SoftECC : A System for Software Memory Integrity Checking , 2005 .

[2]  Valeria Simoncini,et al.  Theory of Inexact Krylov Subspace Methods and Applications to Scientific Computing , 2003, SIAM J. Sci. Comput..

[3]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[4]  Y. Saad,et al.  GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems , 1986 .

[5]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[6]  Satoshi Matsuoka,et al.  A high-performance fault-tolerant software framework for memory on commodity GPUs , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[7]  A. Kleen mcelog : memory error handling in user space , 2010 .

[8]  Bronis R. de Supinski,et al.  Soft error vulnerability of iterative linear algebra methods , 2007, ICS '08.

[9]  Xin Li,et al.  A Realistic Evaluation of Memory Hardware Errors and Software System Susceptibility , 2010, USENIX Annual Technical Conference.

[10]  Jack Dongarra,et al.  Recent Advances in the Message Passing Interface - 17th European MPI Users' Group Meeting, EuroMPI 2010, Stuttgart, Germany, September 12-15, 2010. Proceedings , 2010, EuroMPI.

[11]  Rolf Riesen,et al.  libhashckpt: Hash-Based Incremental Checkpointing Using GPU's , 2011, EuroMPI.

[12]  Jack J. Dongarra,et al.  Using Mixed Precision for Sparse Matrix Computations to Enhance the Performance while Achieving 64-bit Accuracy , 2008, TOMS.

[13]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[14]  Zizhong Chen,et al.  Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[15]  Gerard L. G. Sleijpen,et al.  Inexact Krylov Subspace Methods for Linear Systems , 2004, SIAM J. Matrix Anal. Appl..

[16]  Tamara G. Kolda,et al.  An overview of the Trilinos project , 2005, TOMS.

[17]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[18]  Yousef Saad,et al.  A Flexible Inner-Outer Preconditioned GMRES Algorithm , 1993, SIAM J. Sci. Comput..

[19]  Kurt B. Ferreira,et al.  Fault-tolerant iterative methods via selective reliability. , 2011 .

[20]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.