Impact of fault prediction on checkpointing strategies

This paper deals with the impact of fault prediction techniques on checkpointing strategies. We extend the classical analysis of Young and Daly in the presence of a fault prediction system, which is characterized by its recall and its precision, and which provides either exact or window-based time predictions. We succeed in deriving the optimal value of the checkpointing period (thereby minimizing the waste of resource usage due to checkpoint overhead) in all scenarios. These results allow to analytically assess the key parameters that impact the performance of fault predictors at very large scale. In addition, the results of this analytical evaluation are nicely corroborated by a comprehensive set of simulations, thereby demonstrating the validity of the model and the accuracy of the results.

[1]  Laxmikant V. Kalé,et al.  A scalable double in-memory checkpoint and restart scheme towards exascale , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).

[2]  Cheng-Zhong Xu,et al.  Exploring event correlation for failure prediction in coalitions of clusters , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[3]  Glenn A. Fink,et al.  Predicting Computer System Failures Using Support Vector Machines , 2008, WASL.

[4]  Franck Cappello,et al.  Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[5]  Franck Cappello,et al.  Preventive Migration vs. Preventive Checkpointing for Extreme Scale Supercomputers , 2011, Parallel Process. Lett..

[6]  Zhiling Lan,et al.  A practical failure prediction with location and lead time for Blue Gene/P , 2010, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W).

[7]  Zhiling Lan,et al.  Practical online failure prediction for Blue Gene/P: Period-based vs event-driven , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops (DSN-W).

[8]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..

[9]  James H. Laros,et al.  Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[10]  Hui Xiong,et al.  Failure Prediction in IBM BlueGene/L Event Logs , 2007, ICDM.

[11]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[12]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[13]  Franck Cappello,et al.  Modeling and tolerating heterogeneous failures in large parallel systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[14]  Richard P. Martin,et al.  Improving cluster availability using workstation validation , 2002, SIGMETRICS '02.