Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
暂无分享,去创建一个
Bronis R. de Supinski | Greg Bronevetsky | Adam Moody | Kathryn Mohror | B. Supinski | A. Moody | K. Mohror | G. Bronevetsky
[1] Tyce T. McLarty,et al. Parallel file system testing for the lunatic fringe: the care and feeding of restless I/O power users , 2005, 22nd IEEE / 13th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST'05).
[2] Bhawani Sankar Panda,et al. Performance Evaluation of a Two Level Error Recovery Scheme for Distributed Systems , 2002, IWDC.
[3] Kai Li,et al. Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..
[4] John T. Daly,et al. A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..
[5] John A. Gunnels,et al. Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).
[6] A. B. Langdon,et al. On the dominant and subdominant behavior of stimulated Raman and Brillouin scattering driven by nonuniform laser beams , 1998 .
[7] Kai Li,et al. ickp: a consistent checkpointer for multicomputers , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.
[8] Kai Li,et al. Faster checkpointing with N+1 parity , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.
[9] Stephen L. Scott,et al. Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).
[10] Bianca Schroeder,et al. Understanding failures in petascale computers , 2007 .
[11] Robert E. Lyons,et al. The Use of Triple-Modular Redundancy to Improve Computer Reliability , 1962, IBM J. Res. Dev..
[12] Erol Gelenbe,et al. A model of roll-back recovery with multiple checkpoints , 1976, ICSE '76.
[13] John W. Young,et al. A first order approximation to the optimum checkpoint interval , 1974, CACM.
[14] George Bosilca,et al. Fault tolerant high performance computing by a coding approach , 2005, PPoPP.
[15] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.
[16] Stuart I. Feldman,et al. IGOR: a system for program debugging via reversible execution , 1988, PADD '88.
[17] Larry Rudolph,et al. Cooperative checkpointing: a robust approach to large-scale systems reliability , 2006, ICS '06.
[18] N. Hengartner,et al. Predicting the number of fatal soft errors in Los Alamos national laboratory's ASC Q supercomputer , 2005, IEEE Transactions on Device and Materials Reliability.
[19] James S. Plank,et al. Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems , 2001, J. Parallel Distributed Comput..
[20] Nitin H. Vaidya,et al. A case for two-level distributed recovery schemes , 1995, SIGMETRICS '95/PERFORMANCE '95.
[21] Kamil Iskra,et al. ZOID: I/O-forwarding infrastructure for petascale architectures , 2008, PPoPP.
[22] Randy H. Katz,et al. A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.
[23] Robert B. Ross,et al. Providing Efficient I/O Redundancy in MPI Environments , 2004, PVM/MPI.
[24] Andrzej Duda,et al. The Effects of Checkpointing on Program Execution Time , 1983, Inf. Process. Lett..
[25] B R de Supinski,et al. Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System , 2010 .
[26] Luís Moura Silva,et al. Using two-level stable storge for efficient checkpointing , 1998, IEE Proc. Softw..
[27] P. Nowoczynski,et al. Zest Checkpoint storage system for large supercomputers , 2008, 2008 3rd Petascale Data Storage Workshop.
[28] Nitin H. Vaidya. A Case of Multi-Level Distributed Recovery Schemes , 2001 .