Checkpointing schemes for Grid workflow systems

One of the major challenges in wide use of Grid workflow systems is fault tolerance and avoidance. Checkpointing schemes provide a way of fault detection and recovery. In our research, we focus on the performance optimization of checkpointing schemes and dynamic voltage scaling (DVS) for Grid workflow systems. We propose offline checkpointing schemes with DVS and online adaptive checkpointing schemes that dynamically adjust the checkpointing intervals by using store checkpoints and compare checkpoints. When combined with DVS, offline adaptive checkpointing schemes not only are fault tolerant but also lead to reduce average execution time of tasks. These schemes can efficiently utilize comparison and storage operations and significantly improve the performance. Further, these schemes can calculate the optimal numbers of checkpoints by which the mean execution time can be minimized. We also expand the online adaptive checkpointing schemes from single‐task execution scenarios to multi‐task execution scenarios. Simulation results show that these online schemes outstandingly increase the likelihood of timely task completion when faults occur. Copyright © 2008 John Wiley & Sons, Ltd.

[1]  Michael Goul,et al.  Towards a Verifiable Checkpointing Scheme for Agent-Based Interorganizational Workflow System "Docking Station" Standards , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[2]  Jinjun Chen,et al.  Adaptive selection of necessary and sufficient checkpoints for dynamic verification of temporal constraints in grid workflow systems , 2007, TAAS.

[3]  Soonwook Hwang,et al.  Grid workflow: a flexible failure handling framework for the grid , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.

[4]  Hong Chen,et al.  Performance Optimization of Checkpointing Schemes with Task Duplication , 2006, First International Multi-Symposiums on Computer and Computational Sciences (IMSCCS'06).

[5]  Jinjun Chen,et al.  A taxonomy of grid workflow verification and validation , 2008, Concurr. Comput. Pract. Exp..

[6]  Jehoshua Bruck,et al.  Analysis of Checkpointing Schemes with Task Duplication , 1998, IEEE Trans. Computers.

[7]  Ying Zhang,et al.  Dynamic adaptation for fault tolerance and power management in embedded real-time systems , 2004, TECS.

[8]  Ying Zhang,et al.  Energy-aware adaptive checkpointing in embedded real-time systems , 2003, 2003 Design, Automation and Test in Europe Conference and Exhibition.

[9]  Rajkumar Buyya,et al.  A taxonomy of scientific workflow systems for grid computing , 2005, SGMD.

[10]  Jinjun Chen,et al.  Multiple states based temporal consistency for dynamic verification of fixed‐time constraints in Grid workflow systems , 2007, Concurr. Comput. Pract. Exp..

[11]  Jehoshua Bruck,et al.  An on-line algorithm for checkpoint placement , 1996, Proceedings of ISSRE '96: 7th International Symposium on Software Reliability Engineering.

[12]  Alan Burns,et al.  Analysis of Checkpointing for Real-Time Systems , 2004, Real-Time Systems.

[13]  Byung Kook Kim,et al.  An optimal checkpointing-strategy for real-time control systems under transient faults , 2001, IEEE Trans. Reliab..

[14]  Naohiro Ishii,et al.  Optimal checkpointing intervals of three error detection schemes by a double modular redundancy , 2003 .

[15]  Zhongwen Li,et al.  An Optimal Adaptive Checkpoint Strategy for DMR with Energy-Aware , 2006, 2006 Seventh International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT'06).

[16]  Sang Lyul Min,et al.  Worst case timing requirement of real-time tasks with time redundancy , 1999, Proceedings Sixth International Conference on Real-Time Computing Systems and Applications. RTCSA'99 (Cat. No.PR00306).