Adaptive checkpointing in dynamic grids for uncertain job durations

Adaptive checkpointing is a relatively new approach that is particularly suitable for providing fault-tolerance in dynamic and unstable grid environments. The approach allows for periodic modification of checkpointing intervals at run-time, when additional information becomes available. In this paper an adaptive algorithm, named MeanFailureCP+, is introduced that deals with checkpointing of grid applications with execution times that are unknown a priori. The algorithm modifies its parameters, based on dynamically collected feedback on its performance. Simulation results show that the new algorithm performs even better than adaptive approaches that make use of exact information on job execution times.

[1]  Filip De Turck,et al.  Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids , 2009, IEEE Transactions on Parallel and Distributed Systems.

[2]  Zhiling Lan,et al.  Adaptive Fault Management of Parallel Applications for High-Performance Computing , 2008, IEEE Transactions on Computers.

[3]  Ramendra K. Sahoo,et al.  Evaluating cooperative checkpointing for supercomputing systems , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[4]  Larry Rudolph,et al.  Cooperative checkpointing: a robust approach to large-scale systems reliability , 2006, ICS '06.

[5]  Lefteris Angelis,et al.  Performance and effectiveness trade‐off for checkpointing in fault‐tolerant distributed systems , 2007, Concurr. Comput. Pract. Exp..

[6]  Jehoshua Bruck,et al.  An on-line algorithm for checkpoint placement , 1996, Proceedings of ISSRE '96: 7th International Symposium on Software Reliability Engineering.

[7]  Hong Chen,et al.  Optimizing Adaptive Checkpointing Schemes for Grid Workflow Systems , 2006, 2006 Fifth International Conference on Grid and Cooperative Computing Workshops.