Towards Optimal Multi-Level Checkpointing
暂无分享,去创建一个
[1] Thomas Hérault,et al. Unified model for assessing checkpointing protocols at extreme‐scale , 2014, Concurr. Comput. Pract. Exp..
[2] John W. Young,et al. A first order approximation to the optimum checkpoint interval , 1974, CACM.
[3] Franck Cappello,et al. Modeling and tolerating heterogeneous failures in large parallel systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[4] Aaas News,et al. Book Reviews , 1893, Buffalo Medical and Surgical Journal.
[5] Kai Li,et al. Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..
[6] John T. Daly,et al. A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..
[7] Bronis R. de Supinski,et al. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[8] Yves Robert,et al. Fault-Tolerance Techniques for High-Performance Computing , 2015 .
[9] Franck Cappello,et al. FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[10] Luís Moura Silva,et al. Using two-level stable storge for efficient checkpointing , 1998, IEE Proc. Softw..
[11] R. Gallager. Stochastic Processes , 2014 .
[12] Stephen P. Boyd,et al. Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.
[13] Zizhong Chen,et al. Multilevel Diskless Checkpointing , 2013, IEEE Transactions on Computers.
[14] Franck Cappello,et al. Analysis of the Tradeoffs Between Energy and Run Time for Multilevel Checkpointing , 2014, PMBS@SC.
[15] James H. Laros,et al. Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[16] Franck Cappello,et al. Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model , 2017, IEEE Transactions on Parallel and Distributed Systems.
[17] Nitin H. Vaidya,et al. A case for two-level distributed recovery schemes , 1995, SIGMETRICS '95/PERFORMANCE '95.
[18] Yves Robert,et al. Optimal Resilience Patterns to Cope with Fail-Stop and Silent Errors , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[19] Y. Robert,et al. Fault-Tolerance Techniques for High-Performance Computing , 2015, Computer Communications and Networks.
[20] Franck Cappello,et al. Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.
[21] B R de Supinski,et al. Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System , 2010 .