Cooperative Application/OS DRAM Fault Recovery
暂无分享,去创建一个
Ron Brightwell | Kurt B. Ferreira | Patrick G. Bridges | Michael A. Heroux | Mark Hoemmen | Philip Soltero
[1] Dave Dopson. SoftECC : A System for Software Memory Integrity Checking , 2005 .
[2] Valeria Simoncini,et al. Theory of Inexact Krylov Subspace Methods and Applications to Scientific Computing , 2003, SIAM J. Sci. Comput..
[3] L. Alvisi,et al. A Survey of Rollback-Recovery Protocols , 2002 .
[4] Y. Saad,et al. GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems , 1986 .
[5] Jacob A. Abraham,et al. Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.
[6] Satoshi Matsuoka,et al. A high-performance fault-tolerant software framework for memory on commodity GPUs , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).
[7] A. Kleen. mcelog : memory error handling in user space , 2010 .
[8] Bronis R. de Supinski,et al. Soft error vulnerability of iterative linear algebra methods , 2007, ICS '08.
[9] Xin Li,et al. A Realistic Evaluation of Memory Hardware Errors and Software System Susceptibility , 2010, USENIX Annual Technical Conference.
[10] Jack Dongarra,et al. Recent Advances in the Message Passing Interface - 17th European MPI Users' Group Meeting, EuroMPI 2010, Stuttgart, Germany, September 12-15, 2010. Proceedings , 2010, EuroMPI.
[11] Rolf Riesen,et al. libhashckpt: Hash-Based Incremental Checkpointing Using GPU's , 2011, EuroMPI.
[12] Jack J. Dongarra,et al. Using Mixed Precision for Sparse Matrix Computations to Enhance the Performance while Achieving 64-bit Accuracy , 2008, TOMS.
[13] Yousef Saad,et al. Iterative methods for sparse linear systems , 2003 .
[14] Zizhong Chen,et al. Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.
[15] Gerard L. G. Sleijpen,et al. Inexact Krylov Subspace Methods for Linear Systems , 2004, SIAM J. Matrix Anal. Appl..
[16] Tamara G. Kolda,et al. An overview of the Trilinos project , 2005, TOMS.
[17] Eduardo Pinheiro,et al. DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.
[18] Yousef Saad,et al. A Flexible Inner-Outer Preconditioned GMRES Algorithm , 1993, SIAM J. Sci. Comput..
[19] Kurt B. Ferreira,et al. Fault-tolerant iterative methods via selective reliability. , 2011 .
[20] Timothy A. Davis,et al. The university of Florida sparse matrix collection , 2011, TOMS.