Checkpointing Strategies with Prediction Windows
暂无分享,去创建一个
[1] Richard P. Martin,et al. Improving cluster availability using workstation validation , 2002, SIGMETRICS '02.
[2] Laxmikant V. Kalé,et al. A scalable double in-memory checkpoint and restart scheme towards exascale , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).
[3] Franck Cappello,et al. Modeling and tolerating heterogeneous failures in large parallel systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[4] Anand Sivasubramaniam,et al. Failure Prediction in IBM BlueGene/L Event Logs , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).
[5] Franck Cappello,et al. Fault prediction under the microscope: A closer look into HPC systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[6] Jean-Marc Vincent,et al. A Flexible Checkpoint/Restart Model in Distributed Systems , 2009, PPAM.
[7] Glenn A. Fink,et al. Predicting Computer System Failures Using Support Vector Machines , 2008, WASL.
[8] Zhiling Lan,et al. Fault-Aware Runtime Strategies for High-Performance Computing , 2009, IEEE Transactions on Parallel and Distributed Systems.
[9] John W. Young,et al. A first order approximation to the optimum checkpoint interval , 1974, CACM.
[10] Henri Casanova,et al. Checkpointing strategies for parallel jobs , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[11] James H. Laros,et al. Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[12] Franck Cappello,et al. Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..
[13] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.
[14] John T. Daly,et al. A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..
[15] Yves Robert,et al. Checkpointing algorithms and fault prediction , 2014, J. Parallel Distributed Comput..
[16] Yennun Huang,et al. Software rejuvenation: analysis, module and applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.
[17] Franck Cappello,et al. Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.
[18] Franck Cappello,et al. Preventive Migration vs. Preventive Checkpointing for Extreme Scale Supercomputers , 2011, Parallel Process. Lett..
[19] Zhiling Lan,et al. A practical failure prediction with location and lead time for Blue Gene/P , 2010, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W).
[20] Franck Cappello,et al. The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community , 2009, Int. J. High Perform. Comput. Appl..
[21] Zhiling Lan,et al. Practical online failure prediction for Blue Gene/P: Period-based vs event-driven , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops (DSN-W).
[22] Heon Young Yeom,et al. On the choice of checkpoint interval using memory usage profile and adaptive time series analysis , 2001, Proceedings 2001 Pacific Rim International Symposium on Dependable Computing.
[23] Kishor S. Trivedi,et al. Proactive management of software aging , 2001, IBM J. Res. Dev..
[24] Franck Cappello,et al. Improving the Computing Efficiency of HPC Systems Using a Combination of Proactive and Preventive Checkpointing , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.