Towards Optimal Multi-Level Checkpointing

We provide a framework to analyze multi-level checkpointing protocols, by formally defining a <inline-formula> <tex-math notation="LaTeX">$k$</tex-math><alternatives><inline-graphic xlink:href="benoit-ieq1-2643660.gif"/> </alternatives></inline-formula>-level checkpointing pattern. We provide a first-order approximation to the optimal checkpointing period, and show that the corresponding overhead is in the order of <inline-formula> <tex-math notation="LaTeX">$\sum _{\ell =1}^{k}\sqrt{2\lambda _\ell C_\ell}$</tex-math><alternatives> <inline-graphic xlink:href="benoit-ieq2-2643660.gif"/></alternatives></inline-formula>, where <inline-formula> <tex-math notation="LaTeX">$\lambda _\ell$</tex-math><alternatives> <inline-graphic xlink:href="benoit-ieq3-2643660.gif"/></alternatives></inline-formula> is the error rate at level  <inline-formula><tex-math notation="LaTeX">$\ell$</tex-math><alternatives> <inline-graphic xlink:href="benoit-ieq4-2643660.gif"/></alternatives></inline-formula>, and <inline-formula> <tex-math notation="LaTeX">$C_\ell$</tex-math><alternatives><inline-graphic xlink:href="benoit-ieq5-2643660.gif"/> </alternatives></inline-formula> the checkpointing cost at level <inline-formula><tex-math notation="LaTeX">$\ell$ </tex-math><alternatives><inline-graphic xlink:href="benoit-ieq6-2643660.gif"/></alternatives></inline-formula>. This nicely extends the classical Young/Daly formula on single-level checkpointing. Furthermore, we are able to fully characterize the shape of the optimal pattern (number and positions of checkpoints), and we provide a dynamic programming algorithm to determine the optimal subset of levels to be used. Finally, we perform simulations to check the accuracy of the theoretical study and to confirm the optimality of the subset of levels returned by the dynamic programming algorithm. The results nicely corroborate the theoretical study, and demonstrate the usefulness of multi-level checkpointing with the optimal subset of levels.

[1]  Thomas Hérault,et al.  Unified model for assessing checkpointing protocols at extreme‐scale , 2014, Concurr. Comput. Pract. Exp..

[2]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[3]  Franck Cappello,et al.  Modeling and tolerating heterogeneous failures in large parallel systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[4]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[5]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[6]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[7]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Yves Robert,et al.  Fault-Tolerance Techniques for High-Performance Computing , 2015 .

[9]  Franck Cappello,et al.  FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[10]  Luís Moura Silva,et al.  Using two-level stable storge for efficient checkpointing , 1998, IEE Proc. Softw..

[11]  R. Gallager Stochastic Processes , 2014 .

[12]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[13]  Zizhong Chen,et al.  Multilevel Diskless Checkpointing , 2013, IEEE Transactions on Computers.

[14]  Franck Cappello,et al.  Analysis of the Tradeoffs Between Energy and Run Time for Multilevel Checkpointing , 2014, PMBS@SC.

[15]  James H. Laros,et al.  Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[16]  Franck Cappello,et al.  Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model , 2017, IEEE Transactions on Parallel and Distributed Systems.

[17]  Nitin H. Vaidya,et al.  A case for two-level distributed recovery schemes , 1995, SIGMETRICS '95/PERFORMANCE '95.

[18]  Yves Robert,et al.  Optimal Resilience Patterns to Cope with Fail-Stop and Silent Errors , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[19]  Y. Robert,et al.  Fault-Tolerance Techniques for High-Performance Computing , 2015, Computer Communications and Networks.

[20]  Franck Cappello,et al.  Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[21]  B R de Supinski,et al.  Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System , 2010 .