Checkpointing schemes for Grid workflow systems

One of the major challenges in wide use of Grid workflow systems is fault tolerance and avoidance. Checkpointing schemes provide a way of fault detection and recovery. In our research, we focus on the performance optimization of checkpointing schemes and dynamic voltage scaling (DVS) for Grid workflow systems. We propose offline checkpointing schemes with DVS and online adaptive checkpointing schemes that dynamically adjust the checkpointing intervals by using store checkpoints and compare checkpoints. When combined with DVS, offline adaptive checkpointing schemes not only are fault tolerant but also lead to reduce average execution time of tasks. These schemes can efficiently utilize comparison and storage operations and significantly improve the performance. Further, these schemes can calculate the optimal numbers of checkpoints by which the mean execution time can be minimized. We also expand the online adaptive checkpointing schemes from single-task execution scenarios to multi-task execution scenarios. Simulation results show that these online schemes outstandingly increase the likelihood of timely task completion when faults occur. Copyright © 2008 John Wiley & Sons, Ltd.

[1]  Soonwook Hwang,et al.  Grid workflow: a flexible failure handling framework for the grid , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.

[2]  Jinjun Chen,et al.  Multiple states based temporal consistency for dynamic verification of fixed‐time constraints in Grid workflow systems , 2007, Concurr. Comput. Pract. Exp..

[3]  Ying Zhang,et al.  Dynamic adaptation for fault tolerance and power management in embedded real-time systems , 2004, TECS.

[4]  Sang Lyul Min,et al.  Worst case timing requirement of real-time tasks with time redundancy , 1999, Proceedings Sixth International Conference on Real-Time Computing Systems and Applications. RTCSA'99 (Cat. No.PR00306).

[5]  Ying Zhang,et al.  Energy-aware adaptive checkpointing in embedded real-time systems , 2003, 2003 Design, Automation and Test in Europe Conference and Exhibition.

[6]  Jehoshua Bruck,et al.  Performance Optimization of Checkpointing Schemes with Task Duplication , 1997, IEEE Trans. Computers.

[7]  Byung Kook Kim,et al.  An optimal checkpointing-strategy for real-time control systems under transient faults , 2001, IEEE Trans. Reliab..

[8]  Vaidy S. Sunderam,et al.  Scheduling communication in multithreaded programs: experimental results , 2006, Concurr. Comput. Pract. Exp..

[9]  Michael Goul,et al.  Towards a Verifiable Checkpointing Scheme for Agent-Based Interorganizational Workflow System "Docking Station" Standards , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[10]  Jinjun Chen,et al.  Adaptive selection of necessary and sufficient checkpoints for dynamic verification of temporal constraints in grid workflow systems , 2007, TAAS.

[11]  Naohiro Ishii,et al.  Optimal checkpointing intervals of three error detection schemes by a double modular redundancy , 2003 .

[12]  Jehoshua Bruck,et al.  Analysis of Checkpointing Schemes with Task Duplication , 1998, IEEE Trans. Computers.

[13]  Jinjun Chen,et al.  A taxonomy of grid workflow verification and validation , 2008 .

[14]  Alan Burns,et al.  Analysis of Checkpointing for Real-Time Systems , 2004, Real-Time Systems.

[15]  Jehoshua Bruck,et al.  An On-Line Algorithm for Checkpoint Placement , 1997, IEEE Trans. Computers.

[16]  Rajkumar Buyya,et al.  A taxonomy of scientific workflow systems for grid computing , 2005, SGMD.