Improving the Computing Efficiency of HPC Systems Using a Combination of Proactive and Preventive Checkpointing
暂无分享,去创建一个
Franck Cappello | Ana Gainaru | Naoya Maruyama | Satoshi Matsuoka | Leonardo Arturo Bautista-Gomez | Mohamed-Slim Bouguerra | S. Matsuoka | N. Maruyama | F. Cappello | L. Bautista-Gomez | Ana Gainaru | M. Bouguerra
[1] J. Monaghan,et al. Fundamental differences between SPH and grid methods , 2006, astro-ph/0610051.
[2] Miroslaw Malek,et al. A survey of online failure prediction methods , 2010, CSUR.
[3] Satoshi Matsuoka. Making TSUBAME2.0, the world's greenest production supercomputer, even greener — Challenges to the architects , 2011, IEEE/ACM International Symposium on Low Power Electronics and Design.
[4] Franck Cappello,et al. Distributed Diskless Checkpoint for Large Scale Systems , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.
[5] John T. Daly,et al. A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..
[6] Jianfeng Zhan,et al. LogMaster: Mining Event Correlations in Logs of Large-Scale Cluster Systems , 2010, 2012 IEEE 31st Symposium on Reliable Distributed Systems.
[7] Yves Robert,et al. Impact of fault prediction on checkpointing strategies , 2012, ArXiv.
[8] Zhiling Lan,et al. Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A Case Study , 2008, 2008 37th International Conference on Parallel Processing.
[9] Franck Cappello,et al. Modeling and tolerating heterogeneous failures in large parallel systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[10] Franck Cappello,et al. HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.
[11] Brian Gough,et al. GNU Scientific Library Reference Manual - Third Edition , 2003 .
[12] Nithin Nakka,et al. Predicting Node Failure in High Performance Computing Systems from Failure and Usage Logs , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.
[13] V. Springel. The Cosmological simulation code GADGET-2 , 2005, astro-ph/0505010.
[14] Xiaola Lin,et al. A Variational Calculus Approach to Optimal Checkpoint Placement , 2001, IEEE Trans. Computers.
[15] Doohee Nam,et al. Accident prediction model for railway-highway interfaces. , 2006, Accident; analysis and prevention.
[16] Tongdan Jin,et al. Weibull and Gamma Renewal Approximation Using Generalized Exponential Functions , 2008, Commun. Stat. Simul. Comput..
[17] Enrico Zio,et al. A data-driven approach for predicting failure scenarios in nuclear systems , 2010 .
[18] W YoungJohn. A first order approximation to the optimum checkpoint interval , 1974 .
[19] Franck Cappello,et al. Fault prediction under the microscope: A closer look into HPC systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[20] Bronis R. de Supinski,et al. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[21] John Shalf,et al. The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..
[22] Franck Cappello,et al. Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.
[23] Zhiling Lan,et al. A practical failure prediction with location and lead time for Blue Gene/P , 2010, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W).
[24] Franck Cappello,et al. Low-overhead diskless checkpoint for hybrid computing systems , 2010, 2010 International Conference on High Performance Computing.
[25] Vivek Sarkar,et al. Software challenges in extreme scale systems , 2009 .
[26] Zhiling Lan,et al. Exploit failure prediction for adaptive fault-tolerance in cluster computing , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).
[27] Franck Cappello,et al. On the Use of Cluster-Based Partial Message Logging to Improve Fault Tolerance for MPI HPC Applications , 2011, Euro-Par.
[28] B R de Supinski,et al. Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System , 2010 .
[29] Laxmikant V. Kalé,et al. FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).
[30] Franck Cappello,et al. FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[31] R. C. Hunter,et al. Engine Failure Prediction Techniques , 1975 .
[32] Franck Cappello,et al. Adaptive event prediction strategy with dynamic time window for large-scale HPC systems , 2011, SLAML '11.
[33] John W. Young,et al. A first order approximation to the optimum checkpoint interval , 1974, CACM.
[34] Seetharami R. Seelam,et al. Modeling the Impact of Checkpoints on Next-Generation Systems , 2007, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007).
[35] Daniel J. Price. Modelling discontinuities and Kelvin-Helmholtz instabilities in SPH , 2007, J. Comput. Phys..
[36] Franck Cappello,et al. Event Log Mining Tool for Large Scale HPC Systems , 2011, Euro-Par.
[37] Franck Cappello,et al. Checkpointing vs. Migration for Post-Petascale Supercomputers , 2010, 2010 39th International Conference on Parallel Processing.