Research on Optimum Checkpoint Interval for Hybrid Fault Tolerance

With the rapid growth of the high performance computer system size and complexity, passive fault tolerance can no longer effectively provide reliability of the system because of the high overhead and poor scalability of these methods. Hybrid fault tolerant method which is the combination of passive and active fault tolerant approaches has the potential to be widely used in fault tolerance of exascale system. However, there are still many issues of this method need to be ironed out. This paper focuses on the issues of checkpointing of hybrid fault tolerant method. A common question surrounding checkpointing is the optimization of the checkpoint interval. This paper proposes two models to model the systems which adopt hybrid fault tolerance. By comparing their results with the simulation, this paper evaluates the effectiveness of these two models. Experimental result shows that the modified model can not only predict the total work time excellently, but also can predict the optimum checkpoint interval precisely.

[1]  John Daly A Model for Predicting the Optimum Checkpoint Interval for Restart Dumps , 2003, International Conference on Computational Science.

[2]  Miroslaw Malek,et al.  A survey of online failure prediction methods , 2010, CSUR.

[3]  Jack Dongarra,et al.  Computational Science — ICCS 2003 , 2003, Lecture Notes in Computer Science.

[4]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[5]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Rolf Riesen,et al.  Fault-tolerance for exascale systems , 2010, 2010 IEEE International Conference On Cluster Computing Workshops and Posters (CLUSTER WORKSHOPS).

[7]  Kishor S. Trivedi,et al.  Performance Assurance via Software Rejuvenation: Monitoring, Statistics and Algorithms , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[8]  Franck Cappello,et al.  Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities , 2009, Int. J. High Perform. Comput. Appl..

[9]  Nian-Feng Tzeng,et al.  Adaptive Incremental Checkpointing via Delta Compression for Networked Multicore Systems , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[10]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[11]  Chao Wang,et al.  Proactive process-level live migration and back migration in HPC environments , 2012, J. Parallel Distributed Comput..

[12]  Philip S. Yu,et al.  Toward Predictive Failure Management for Distributed Stream Processing Systems , 2008, 2008 The 28th International Conference on Distributed Computing Systems.

[13]  Emilio Luque,et al.  What is Missing in Current Checkpoint Interval Models? , 2011, 2011 31st International Conference on Distributed Computing Systems.

[14]  William Gropp,et al.  Fault Tolerance in Message Passing Interface Programs , 2004, Int. J. High Perform. Comput. Appl..

[15]  Rajeev Thakur,et al.  A Meta-Learning Failure Predictor for Blue Gene/L Systems , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).