Fault Detection, Prevention and Recovery in Current Grid Workflow Systems

The workflow paradigm is a highly successful paradigm for the creation of Grid applications. Despite the popularity of the workflow approach, the systems that support the execution of workflow applications in Grid environments are still not able to deliver the quality, robustness and reliability that their users require and demand. To understand the current state-of-the-art and the reasons behind the shortcomings, we sent out a detailed questionnaire to developers of many of the major Grid workflow systems. This paper shows the outcome of the questionnaire evaluation, reveals future directions and helps to guide research towards the identified open issues in adoption of fault tolerance techniques.

[1]  Ian J. Taylor,et al.  The Triana Workflow Environment: Architecture and Applications , 2007, Workflows for e-Science, Scientific Workflows for Grids.

[2]  Denis Caromel,et al.  A Hybrid Message Logging-CIC Protocol for Constrained Checkpointability , 2005, Euro-Par.

[3]  Johan Tordsson,et al.  A Light-Weight Grid Workflow Execution Engine Enabling Client and Middleware Independence , 2007, PPAM.

[4]  Douglas Thain,et al.  Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..

[5]  Bernd Schuller,et al.  Chemomentum - UNICORE 6 Based Infrastructure for Complex Applications in Science and Technology , 2007, Euro-Par Workshops.

[6]  Rajkumar Buyya,et al.  A Taxonomy of Workflow Management Systems for Grid Computing , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[7]  Ewa Deelman,et al.  Integrating existing scientific workflow systems: the Kepler/Pegasus example , 2007, WORKS '07.

[8]  Péter Kacsuk,et al.  Multi-Grid, Multi-User Workflows in the P-GRADE Grid Portal , 2005, Journal of Grid Computing.

[9]  Ian T. Foster,et al.  The Globus project: a status report , 1998, Proceedings Seventh Heterogeneous Computing Workshop (HCW'98).