An optimal checkpoint/restart model for a large scale high performance computing system
暂无分享,去创建一个
Stephen L. Scott | Raja Nassar | Chokchai Leangsuksun | Mihaela Paun | Nichamon Naksinehaboon | Yudan Liu
[1] David F. Heidel,et al. An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).
[2] Mark A. Franklin,et al. Distributed computing systems and checkpointing , 1993, [1993] Proceedings The 2nd International Symposium on High Performance Distributed Computing.
[3] K. Mani Chandy,et al. Analytic models for rollback and recovery strategies in data base systems , 1975, IEEE Transactions on Software Engineering.
[4] Xiaola Lin,et al. A Variational Calculus Approach to Optimal Checkpoint Placement , 2001, IEEE Trans. Computers.
[5] John W. Young,et al. A first order approximation to the optimum checkpoint interval , 1974, CACM.
[6] John T. Daly,et al. A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..
[7] John Daly. A Model for Predicting the Optimum Checkpoint Interval for Restart Dumps , 2003, International Conference on Computational Science.
[8] Larry Rudolph,et al. Cooperative checkpointing: a robust approach to large-scale systems reliability , 2006, ICS '06.
[9] Chokchai Leangsuksun,et al. On the Survivability of Standard MPI Applications , 2006 .
[10] James S. Plank,et al. The average availability of parallel checkpointing systems and its importance in selecting runtime parameters , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).
[11] E. N. Elnozahy,et al. Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.
[12] Jason Duell,et al. The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..
[13] Yudan Liu. Reliability -aware optimal checkpoint /restart model in high performance computing , 2007 .
[14] Robert Geist,et al. Selection of a checkpoint interval in a critical-task environment , 1988 .
[15] Tadashi Dohi,et al. Distribution-free checkpoint placement algorithms based on min-max principle , 2006, IEEE Transactions on Dependable and Secure Computing.
[16] Sheldon M. Ross,et al. Stochastic Processes , 2018, Gauge Integral Structures for Stochastic Calculus and Quantum Electrodynamics.
[17] Victor F. Nicola,et al. Checkpointing and the modeling of program execution time , 1994 .
[18] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.
[19] K. Mani Chandy,et al. A Survey of Analytic Models of Rollback and Recovery Stratergies , 1975, Computer.
[20] Michael Treaster. A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems , 2005, ArXiv.