Detailed Modeling and Evaluation of a Scalable Multilevel Checkpointing System
暂无分享,去创建一个
[1] Bhawani Sankar Panda,et al. Performance Evaluation of a Two Level Error Recovery Scheme for Distributed Systems , 2002, IWDC.
[2] Luís Moura Silva,et al. Using two-level stable storge for efficient checkpointing , 1998, IEE Proc. Softw..
[3] Erol Gelenbe,et al. A model of roll-back recovery with multiple checkpoints , 1976, ICSE '76.
[4] Erich Schikuta,et al. Parallel I/O , 2001, Int. J. High Perform. Comput. Appl..
[5] Bronis R. de Supinski,et al. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[6] Thomas Hérault,et al. A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI , 2012, Euro-Par.
[7] Kai Li,et al. Faster checkpointing with N+1 parity , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.
[8] Nitin H. Vaidya. A Case of Multi-Level Distributed Recovery Schemes , 2001 .
[9] Nitin H. Vaidya,et al. A case for two-level distributed recovery schemes , 1995, SIGMETRICS '95/PERFORMANCE '95.
[10] Franck Cappello,et al. FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[11] John W. Young,et al. A first order approximation to the optimum checkpoint interval , 1974, CACM.
[12] James S. Plank,et al. Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems , 2001, J. Parallel Distributed Comput..
[13] A. B. Langdon,et al. On the dominant and subdominant behavior of stimulated Raman and Brillouin scattering driven by nonuniform laser beams , 1998 .
[14] John A. Gunnels,et al. Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).
[15] Kamil Iskra,et al. ZOID: I/O-forwarding infrastructure for petascale architectures , 2008, PPoPP.
[16] John T. Daly,et al. A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..
[17] D. Quinlan,et al. Inter-Agency Workshop on HPC Resilience at Extreme Scale National Security Agency Advanced Computing Systems February 21 – 24 , 2012 Coordinating Representatives John Daly ( DOD ) Bill Harrod ( DOE / SC ) Thuc Hoang ( DOE / NNSA , 2012 .
[18] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.
[19] Robert E. Lyons,et al. The Use of Triple-Modular Redundancy to Improve Computer Reliability , 1962, IBM J. Res. Dev..
[20] Andrzej Duda,et al. The Effects of Checkpointing on Program Execution Time , 1983, Inf. Process. Lett..
[21] N. Hengartner,et al. Predicting the number of fatal soft errors in Los Alamos national laboratory's ASC Q supercomputer , 2005, IEEE Transactions on Device and Materials Reliability.