Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model
暂无分享,去创建一个
Franck Cappello | Yves Robert | Frédéric Vivien | Sheng Di | Y. Robert | F. Vivien | F. Cappello | S. Di
[1] Henri Casanova,et al. Checkpointing strategies for parallel jobs , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[2] Laxmikant V. Kalé,et al. A scalable double in-memory checkpoint and restart scheme towards exascale , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).
[3] Liaojun Pang,et al. Two-Level Incremental Checkpoint Recovery Scheme for Reducing System Total Overheads , 2014, PloS one.
[4] Franck Cappello,et al. Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.
[5] John W. Young,et al. A first order approximation to the optimum checkpoint interval , 1974, CACM.
[6] Michael Lang,et al. The design and implementation of a multi-level content-addressable checkpoint file system , 2012, 2012 19th International Conference on High Performance Computing.
[7] Rolf Riesen,et al. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing , 2012, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.
[8] Franck Cappello,et al. FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[9] Henri Casanova,et al. On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing , 2015, Future Gener. Comput. Syst..
[10] Austin R. Benson,et al. Silent error detection in numerical time-stepping schemes , 2015, Int. J. High Perform. Comput. Appl..
[11] Bronis R. de Supinski,et al. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[12] Mariana Vertenstein,et al. The Parallel Ocean Program (POP) reference manual: Ocean component of the Community Climate System Model (CCSM) , 2010 .
[13] Franck Cappello,et al. Distributed Diskless Checkpoint for Large Scale Systems , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.
[14] Jaeyoung Choi,et al. Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines , 1994, Sci. Program..
[15] Nitin H. Vaidya,et al. A case for two-level distributed recovery schemes , 1995, SIGMETRICS '95/PERFORMANCE '95.
[16] Franck Cappello,et al. Fast Error-Bounded Lossy HPC Data Compression with SZ , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[17] Franck Cappello,et al. Optimization of a Multilevel Checkpoint Model with Uncertain Execution Scales , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.
[18] Zizhong Chen,et al. Multilevel Diskless Checkpointing , 2013, IEEE Transactions on Computers.
[19] John Paul Walters,et al. Replication-Based Fault Tolerance for MPI Applications , 2009, IEEE Transactions on Parallel and Distributed Systems.
[20] B R de Supinski,et al. Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System , 2010 .
[21] John Paul Walters,et al. A Scalable Asynchronous Replication-Based Strategy for Fault Tolerant MPI Applications , 2007, HiPC.
[22] John T. Daly,et al. A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..
[23] Sheri Mickelson,et al. Community Earth System Model (CESM) , 2011, Encyclopedia of Parallel Computing.
[24] Franck Cappello,et al. Low-overhead diskless checkpoint for hybrid computing systems , 2010, 2010 International Conference on High Performance Computing.
[25] Shangping Ren,et al. Adaptive optimal checkpoint interval and its impact on system's overall quality in soft real-time applications , 2009, SAC '09.
[26] Xian-He Sun,et al. Optimizing HPC Fault-Tolerant Environment: An Analytical Approach , 2010, 2010 39th International Conference on Parallel Processing.
[27] Stephen L. Scott,et al. An optimal checkpoint/restart model for a large scale high performance computing system , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.