Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model

The traditional single-level checkpointing method suffers from significant overhead on large-scale platforms. Hence, multilevel checkpointing protocols have been studied extensively in recent years. The multilevel checkpoint approach allows different levels of checkpoints to be set (each with different checkpoint overheads and recovery abilities), in order to further improve the fault tolerance performance of extreme-scale HPC applications. How to optimize the checkpoint intervals for each level, however, is an extremely difficult problem. In this paper, we construct an easy-to-use two-level checkpoint model. Checkpoint level 1 deals with errors with low checkpoint/recovery overheads such as transient memory errors, while checkpoint level 2 deals with hardware crashes such as node failures. Compared with previous optimization work, our new optimal checkpoint solution offers two improvements: (1) it is an online solution without requiring knowledge of the job length in advance, and (2) it shows that periodic patterns are optimal and determines the best pattern. We evaluate the proposed solution and compare it with the most up-to-date related approaches on an extreme-scale simulation testbed constructed based on a real HPC application execution. Simulation results show that our proposed solution outperforms other optimized solutions and can improve the performance significantly in some cases. Specifically, with the new solution the wall-clock time can be reduced by up to 25.3 percent over that of other state-of-the-art approaches. Finally, a brute-force comparison with all possible patterns shows that our solution is always within 1 percent of the best pattern in the experiments.

[1]  Henri Casanova,et al.  Checkpointing strategies for parallel jobs , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[2]  Laxmikant V. Kalé,et al.  A scalable double in-memory checkpoint and restart scheme towards exascale , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).

[3]  Liaojun Pang,et al.  Two-Level Incremental Checkpoint Recovery Scheme for Reducing System Total Overheads , 2014, PloS one.

[4]  Franck Cappello,et al.  Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[5]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[6]  Michael Lang,et al.  The design and implementation of a multi-level content-addressable checkpoint file system , 2012, 2012 19th International Conference on High Performance Computing.

[7]  Rolf Riesen,et al.  Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing , 2012, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[8]  Franck Cappello,et al.  FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[9]  Henri Casanova,et al.  On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing , 2015, Future Gener. Comput. Syst..

[10]  Austin R. Benson,et al.  Silent error detection in numerical time-stepping schemes , 2015, Int. J. High Perform. Comput. Appl..

[11]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Mariana Vertenstein,et al.  The Parallel Ocean Program (POP) reference manual: Ocean component of the Community Climate System Model (CCSM) , 2010 .

[13]  Franck Cappello,et al.  Distributed Diskless Checkpoint for Large Scale Systems , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[14]  Jaeyoung Choi,et al.  Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines , 1994, Sci. Program..

[15]  Nitin H. Vaidya,et al.  A case for two-level distributed recovery schemes , 1995, SIGMETRICS '95/PERFORMANCE '95.

[16]  Franck Cappello,et al.  Fast Error-Bounded Lossy HPC Data Compression with SZ , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[17]  Franck Cappello,et al.  Optimization of a Multilevel Checkpoint Model with Uncertain Execution Scales , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Zizhong Chen,et al.  Multilevel Diskless Checkpointing , 2013, IEEE Transactions on Computers.

[19]  John Paul Walters,et al.  Replication-Based Fault Tolerance for MPI Applications , 2009, IEEE Transactions on Parallel and Distributed Systems.

[20]  B R de Supinski,et al.  Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System , 2010 .

[21]  John Paul Walters,et al.  A Scalable Asynchronous Replication-Based Strategy for Fault Tolerant MPI Applications , 2007, HiPC.

[22]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[23]  Sheri Mickelson,et al.  Community Earth System Model (CESM) , 2011, Encyclopedia of Parallel Computing.

[24]  Franck Cappello,et al.  Low-overhead diskless checkpoint for hybrid computing systems , 2010, 2010 International Conference on High Performance Computing.

[25]  Shangping Ren,et al.  Adaptive optimal checkpoint interval and its impact on system's overall quality in soft real-time applications , 2009, SAC '09.

[26]  Xian-He Sun,et al.  Optimizing HPC Fault-Tolerant Environment: An Analytical Approach , 2010, 2010 39th International Conference on Parallel Processing.

[27]  Stephen L. Scott,et al.  An optimal checkpoint/restart model for a large scale high performance computing system , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.