Improving the Computing Efficiency of HPC Systems Using a Combination of Proactive and Preventive Checkpointing

As the failure frequency is increasing with the components count in modern and future supercomputers, resilience is becoming critical for extreme scale systems. The association of failure prediction with proactive checkpointing seeks to reduce the effect of failures in the execution time of parallel applications. Unfortunately, proactive checkpointing does not systematically avoid restarting from scratch. To mitigate this issue, failure prediction and proactive checkpointing can be coupled with periodic checkpointing. However, blind use of these techniques does not always improves system efficiency, because everyone of them comes with a mix of overheads and benefits. In order to study and understand the combination of these techniques and their improvement in the system's efficiency, we developed: (i) a prototype combining state of the art failure prediction, fast proactive checkpointing and preventive checkpointing; (ii) a mathematical model that reflects the expected computing efficiency of the combination and computes the optimal checkpointing interval in this context; (iii) a discrete event simulator to evaluate the computing efficiency of the combination for system parameters corresponding to the current and projected large scale HPC systems. We evaluate our proposed technique on a large supercomputer (i.e. TSUBAME2) with production-level HPC applications and we show that failure prediction, proactive and preventive checkpointing can be coupled successfully, imposing only about 2% to 6% of overhead in comparison with preventive checkpointing only. Moreover, our model-based simulations show that the optimal solution improves the computing efficiency up to 30% in comparison with classic periodic checkpointing. We show that the prediction recall has a much higher impact on execution efficiency than the prediction precision. This result suggests that researchers on failure prediction algorithms should focus on improving the recall. We also show that the combination of these techniques can significantly improve (by a factor 2, for a particular configuration) the mean time between failures (MTBF) perceived by the application.

[1]  J. Monaghan,et al.  Fundamental differences between SPH and grid methods , 2006, astro-ph/0610051.

[2]  Miroslaw Malek,et al.  A survey of online failure prediction methods , 2010, CSUR.

[3]  Satoshi Matsuoka Making TSUBAME2.0, the world's greenest production supercomputer, even greener — Challenges to the architects , 2011, IEEE/ACM International Symposium on Low Power Electronics and Design.

[4]  Franck Cappello,et al.  Distributed Diskless Checkpoint for Large Scale Systems , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[5]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[6]  Jianfeng Zhan,et al.  LogMaster: Mining Event Correlations in Logs of Large-Scale Cluster Systems , 2010, 2012 IEEE 31st Symposium on Reliable Distributed Systems.

[7]  Yves Robert,et al.  Impact of fault prediction on checkpointing strategies , 2012, ArXiv.

[8]  Zhiling Lan,et al.  Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A Case Study , 2008, 2008 37th International Conference on Parallel Processing.

[9]  Franck Cappello,et al.  Modeling and tolerating heterogeneous failures in large parallel systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[10]  Franck Cappello,et al.  HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[11]  Brian Gough,et al.  GNU Scientific Library Reference Manual - Third Edition , 2003 .

[12]  Nithin Nakka,et al.  Predicting Node Failure in High Performance Computing Systems from Failure and Usage Logs , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[13]  V. Springel The Cosmological simulation code GADGET-2 , 2005, astro-ph/0505010.

[14]  Xiaola Lin,et al.  A Variational Calculus Approach to Optimal Checkpoint Placement , 2001, IEEE Trans. Computers.

[15]  Doohee Nam,et al.  Accident prediction model for railway-highway interfaces. , 2006, Accident; analysis and prevention.

[16]  Tongdan Jin,et al.  Weibull and Gamma Renewal Approximation Using Generalized Exponential Functions , 2008, Commun. Stat. Simul. Comput..

[17]  Enrico Zio,et al.  A data-driven approach for predicting failure scenarios in nuclear systems , 2010 .

[18]  W YoungJohn A first order approximation to the optimum checkpoint interval , 1974 .

[19]  Franck Cappello,et al.  Fault prediction under the microscope: A closer look into HPC systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[20]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[21]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[22]  Franck Cappello,et al.  Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[23]  Zhiling Lan,et al.  A practical failure prediction with location and lead time for Blue Gene/P , 2010, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W).

[24]  Franck Cappello,et al.  Low-overhead diskless checkpoint for hybrid computing systems , 2010, 2010 International Conference on High Performance Computing.

[25]  Vivek Sarkar,et al.  Software challenges in extreme scale systems , 2009 .

[26]  Zhiling Lan,et al.  Exploit failure prediction for adaptive fault-tolerance in cluster computing , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[27]  Franck Cappello,et al.  On the Use of Cluster-Based Partial Message Logging to Improve Fault Tolerance for MPI HPC Applications , 2011, Euro-Par.

[28]  B R de Supinski,et al.  Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System , 2010 .

[29]  Laxmikant V. Kalé,et al.  FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[30]  Franck Cappello,et al.  FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[31]  R. C. Hunter,et al.  Engine Failure Prediction Techniques , 1975 .

[32]  Franck Cappello,et al.  Adaptive event prediction strategy with dynamic time window for large-scale HPC systems , 2011, SLAML '11.

[33]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[34]  Seetharami R. Seelam,et al.  Modeling the Impact of Checkpoints on Next-Generation Systems , 2007, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007).

[35]  Daniel J. Price Modelling discontinuities and Kelvin-Helmholtz instabilities in SPH , 2007, J. Comput. Phys..

[36]  Franck Cappello,et al.  Event Log Mining Tool for Large Scale HPC Systems , 2011, Euro-Par.

[37]  Franck Cappello,et al.  Checkpointing vs. Migration for Post-Petascale Supercomputers , 2010, 2010 39th International Conference on Parallel Processing.