Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications
暂无分享,去创建一个
Franck Cappello | Sheng Di | Leonardo Arturo Bautista-Gomez | Mohamed-Slim Bouguerra | F. Cappello | L. Bautista-Gomez | M. Bouguerra | S. Di
[1] John A. Gunnels,et al. Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).
[2] Franck Cappello,et al. Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.
[3] Ravishankar K. Iyer,et al. Modeling coordinated checkpointing for large-scale supercomputers , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).
[4] Franck Cappello,et al. Modeling and tolerating heterogeneous failures in large parallel systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[5] Buddy Bland,et al. Titan - Early experience with the Titan system at Oak Ridge National Laboratory , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.
[6] Bronis R. de Supinski,et al. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[7] Edgar Dehn,et al. Algebraic Equations: An Introduction to the Theories of Lagrange and Galois , 1934, The Mathematical Gazette.
[8] Van-Anh Truong,et al. Availability in Globally Distributed Storage Systems , 2010, OSDI.
[9] John T. Daly,et al. A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..
[10] Franck Cappello,et al. FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[11] Franck Cappello,et al. Distributed Diskless Checkpoint for Large Scale Systems , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.
[12] John W. Young,et al. A first order approximation to the optimum checkpoint interval , 1974, CACM.
[13] Franck Cappello,et al. Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..
[14] Laxmikant V. Kalé,et al. A scalable double in-memory checkpoint and restart scheme towards exascale , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).
[15] Kurt B. Ferreira,et al. Keeping checkpoint/restart viable for exascale systems , 2011 .
[16] John Shalf,et al. The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..
[17] Yves Robert,et al. Checkpointing algorithms and fault prediction , 2014, J. Parallel Distributed Comput..
[18] Bianca Schroeder,et al. Understanding failures in petascale computers , 2007 .
[19] J. Hüsler,et al. Laws of Small Numbers: Extremes and Rare Events , 1994 .
[20] Jason Duell,et al. Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters , 2006 .
[21] Henri Casanova,et al. Checkpointing strategies for parallel jobs , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[22] Franck Cappello,et al. Improving the Computing Efficiency of HPC Systems Using a Combination of Proactive and Preventive Checkpointing , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[23] James H. Laros,et al. Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[24] Bronis R. de Supinski,et al. Detailed Modeling and Evaluation of a Scalable Multilevel Checkpointing System , 2014, IEEE Transactions on Parallel and Distributed Systems.
[25] Mariana Vertenstein,et al. The Parallel Ocean Program (POP) reference manual: Ocean component of the Community Climate System Model (CCSM) , 2010 .
[26] Franck Cappello,et al. Low-overhead diskless checkpoint for hybrid computing systems , 2010, 2010 International Conference on High Performance Computing.