Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods
暂无分享,去创建一个
[1] Christian Engelmann,et al. A Tunable, Software-Based DRAM Error Detection and Correction Library for HPC , 2011, Euro-Par Workshops.
[2] Kurt B. Ferreira,et al. Fault-tolerant linear solvers via selective reliability , 2012, ArXiv.
[3] Jacob A. Abraham,et al. Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.
[4] Christian Engelmann,et al. Poster: a tunable, software-based DRAM error detection and correction library for HPC , 2011, SC '11 Companion.
[5] Zizhong Chen,et al. Numerically Stable Real Number Codes Based on Random Matrices , 2005, International Conference on Computational Science.
[6] Daniel Marques,et al. Automated application-level checkpointing of MPI programs , 2003, PPoPP '03.
[7] Zizhong Chen,et al. Optimal real number codes for fault tolerant matrix operations , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.
[8] John T. Daly,et al. A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..
[9] Ron Brightwell,et al. Cooperative Application/OS DRAM Fault Recovery , 2011, Euro-Par Workshops.
[10] Yousef Saad,et al. Iterative methods for sparse linear systems , 2003 .
[11] Zizhong Chen,et al. Algorithmic Cholesky factorization fault recovery , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).
[12] Dhabaleswar K. Panda,et al. CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems , 2009, 2009 International Conference on Parallel Processing.
[13] Franck Cappello,et al. Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..
[14] Bronis R. de Supinski,et al. Soft error vulnerability of iterative linear algebra methods , 2007, ICS '08.
[15] Hui Liu,et al. High performance linpack benchmark: a fault tolerant implementation without checkpointing , 2011, ICS '11.
[16] Zizhong Chen,et al. Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing , 2009, IEEE Transactions on Computers.
[17] Zizhong Chen,et al. Algorithm-Based Fault Tolerance for Fail-Stop Failures , 2008, IEEE Transactions on Parallel and Distributed Systems.
[18] Mahmut T. Kandemir,et al. Analyzing the soft error resilience of linear solvers on multicore multiprocessors , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).
[19] Chao Wang,et al. A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.
[20] Zizhong Chen,et al. Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.
[21] Zizhong Chen. Algorithm-based recovery for iterative methods without checkpointing , 2011, HPDC '11.
[22] Zizhong Chen,et al. Condition Numbers of Gaussian Random Matrices , 2005, SIAM J. Matrix Anal. Appl..
[23] George Bosilca,et al. Fault tolerant high performance computing by a coding approach , 2005, PPoPP.
[24] Greg Burns,et al. LAM: An Open Cluster Environment for MPI , 2002 .
[25] Kai Li,et al. Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..
[26] Vaidy S. Sunderam,et al. PVM: A Framework for Parallel Distributed Computing , 1990, Concurr. Pract. Exp..
[27] Padma Raghavan,et al. Characterizing the impact of soft errors on iterative methods in scientific computing , 2011, ICS '11.