Efficient checkpoint/verification patterns
暂无分享,去创建一个
[1] George Bosilca,et al. Algorithm-based fault tolerance applied to high performance computing , 2009, J. Parallel Distributed Comput..
[2] Jacob A. Abraham,et al. Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.
[3] Robert E. Lyons,et al. The Use of Triple-Modular Redundancy to Improve Computer Reliability , 1962, IBM J. Res. Dev..
[4] Richard W. Vuduc,et al. Self-stabilizing iterative solvers , 2013, ScalA '13.
[5] Padma Raghavan,et al. Fault tolerant preconditioned conjugate gradient for sparse linear system solution , 2012, ICS '12.
[6] James L. Walsh,et al. IBM experiments in soft fails in computer electronics (1978-1994) , 1996, IBM J. Res. Dev..
[7] Zizhong Chen,et al. Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods , 2013, PPoPP '13.
[8] Hans P. Muhlfeld,et al. Cosmic ray soft error rates of 16-Mb DRAM memory chips , 1998, IEEE J. Solid State Circuits.
[9] Bronis R. de Supinski,et al. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[10] Bronis R. de Supinski,et al. Soft error vulnerability of iterative linear algebra methods , 2007, ICS '08.
[11] Bianca Schroeder,et al. Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design , 2012, ASPLOS XVII.
[12] Austin R. Benson,et al. Silent error detection in numerical time-stepping schemes , 2015, Int. J. High Perform. Comput. Appl..
[13] Laxmikant V. Kalé,et al. ACR: Automatic checkpoint/restart for soft and hard error protection , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[14] Thomas Hérault,et al. On the Combination of Silent Error Detection and Checkpointing , 2013, 2013 IEEE 19th Pacific Rim International Symposium on Dependable Computing.
[15] L. Alvisi,et al. A Survey of Rollback-Recovery Protocols , 2002 .
[16] Andrew A. Chien,et al. When is multi-version checkpointing needed? , 2013, FTXS '13.
[17] Huntington W. Curtis,et al. Accelerated testing for cosmic soft-error rate , 1996, IBM J. Res. Dev..
[18] Rolf Riesen,et al. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing , 2012, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.
[19] Leslie Lamport,et al. Distributed snapshots: determining global states of distributed systems , 1985, TOCS.
[20] John W. Young,et al. A first order approximation to the optimum checkpoint interval , 1974, CACM.
[21] T. J. O'Gorman. The effect of cosmic rays on the soft error rate of a DRAM at ground level , 1994 .
[22] John T. Daly,et al. A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..
[23] B R de Supinski,et al. Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System , 2010 .
[24] Kurt B. Ferreira,et al. Fault-tolerant iterative methods via selective reliability. , 2011 .
[25] Christian Engelmann,et al. Combining Partial Redundancy and Checkpointing for HPC , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.
[26] Franck Cappello,et al. The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community , 2009, Int. J. High Perform. Comput. Appl..