Checkpointing Strategies with Prediction Windows

This paper deals with the impact of fault prediction techniques on check pointing strategies. We consider fault-prediction systems that do not provide exact prediction dates, but instead time intervals during which faults are predicted to strike. These intervals dramatically complicate the analysis of the check pointing strategies. We propose a new approach based upon two periodic modes, a regular mode outside prediction windows, and a proactive mode inside prediction windows, whenever the size of these windows is large enough. We are able to compute the best period for any size of the prediction windows, thereby deriving the scheduling strategy that minimizes platform waste. In addition, the results of the analytical study are nicely corroborated by a comprehensive set of simulations, which demonstrate the validity of the model and the accuracy of the approach.

[1]  Richard P. Martin,et al.  Improving cluster availability using workstation validation , 2002, SIGMETRICS '02.

[2]  Laxmikant V. Kalé,et al.  A scalable double in-memory checkpoint and restart scheme towards exascale , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).

[3]  Franck Cappello,et al.  Modeling and tolerating heterogeneous failures in large parallel systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[4]  Anand Sivasubramaniam,et al.  Failure Prediction in IBM BlueGene/L Event Logs , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[5]  Franck Cappello,et al.  Fault prediction under the microscope: A closer look into HPC systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Jean-Marc Vincent,et al.  A Flexible Checkpoint/Restart Model in Distributed Systems , 2009, PPAM.

[7]  Glenn A. Fink,et al.  Predicting Computer System Failures Using Support Vector Machines , 2008, WASL.

[8]  Zhiling Lan,et al.  Fault-Aware Runtime Strategies for High-Performance Computing , 2009, IEEE Transactions on Parallel and Distributed Systems.

[9]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[10]  Henri Casanova,et al.  Checkpointing strategies for parallel jobs , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[11]  James H. Laros,et al.  Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[12]  Franck Cappello,et al.  Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..

[13]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[14]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[15]  Yves Robert,et al.  Checkpointing algorithms and fault prediction , 2014, J. Parallel Distributed Comput..

[16]  Yennun Huang,et al.  Software rejuvenation: analysis, module and applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[17]  Franck Cappello,et al.  Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[18]  Franck Cappello,et al.  Preventive Migration vs. Preventive Checkpointing for Extreme Scale Supercomputers , 2011, Parallel Process. Lett..

[19]  Zhiling Lan,et al.  A practical failure prediction with location and lead time for Blue Gene/P , 2010, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W).

[20]  Franck Cappello,et al.  The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community , 2009, Int. J. High Perform. Comput. Appl..

[21]  Zhiling Lan,et al.  Practical online failure prediction for Blue Gene/P: Period-based vs event-driven , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops (DSN-W).

[22]  Heon Young Yeom,et al.  On the choice of checkpoint interval using memory usage profile and adaptive time series analysis , 2001, Proceedings 2001 Pacific Rim International Symposium on Dependable Computing.

[23]  Kishor S. Trivedi,et al.  Proactive management of software aging , 2001, IBM J. Res. Dev..

[24]  Franck Cappello,et al.  Improving the Computing Efficiency of HPC Systems Using a Combination of Proactive and Preventive Checkpointing , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.