Grid workflow: a flexible failure handling framework for the grid

The generic, heterogeneous, and dynamic nature of the grid requires a new from of failure recovery mechanism to address its unique requirements such as support for diverse failure handling strategies, separation of failure handling strategies from application codes, and user-defined exception handling. We here propose a grid workflow system (grid-WFS), a flexible failure handling framework for the grid, which addresses these grid-unique failure recovery requirements. Central to the framework is flexibility by the use of workflow structure as a high-level recovery policy specification. We show how this use of high-level workflow structure allows users to achieve failure recovery in a variety of ways depending on the requirements and constraints of their applications. We also demonstrate that this use of workflow structure enables users to not only rapidly prototype and investigate failure handling strategies, but also easily change them by simply modifying the encompassing workflow structure, while the application code remains intact. Finally, we present an experimental evaluation of our framework using a simulation, demonstrating the value of supporting multiple failure recovery techniques in grid systems to achieve high performance in the presence of failures.

[1]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..

[2]  Jack Dongarra,et al.  PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing , 1995 .

[3]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[4]  Ian T. Foster,et al.  Application experiences with the Globus toolkit , 1998, Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244).

[5]  Leonard Kleinrock,et al.  Queueing Systems: Volume I-Theory , 1975 .

[6]  Andreas Reuter,et al.  Transaction Processing: Concepts and Techniques , 1992 .

[7]  Gregor von Laszewski,et al.  CoG kits: a bridge between commodity distributed computing and high-performance grids , 2000, JAVA '00.

[8]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[9]  Ian Foster,et al.  The Globus toolkit , 1998 .

[10]  Proceedings 12th IEEE International Symposium on High Performance Distributed Computing , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.

[11]  Erik Seligman,et al.  Application Level Fault Tolerance in Heterogenous Networks of Workstations , 1997, J. Parallel Distributed Comput..

[12]  John S. Heidemann,et al.  Replication in Ficus distributed file systems , 1990, [1990] Proceedings. Workshop on the Management of Replicated Data.

[13]  Amit P. Sheth,et al.  An overview of workflow management: From process modeling to workflow automation infrastructure , 1995, Distributed and Parallel Databases.

[14]  Joel H. Saltz,et al.  Parallel Programming Using C++ , 1996 .

[15]  Ian T. Foster,et al.  Condor-G: A Computation Management Agent for Multi-Institutional Grids , 2004, Cluster Computing.

[16]  Felix C. Gärtner,et al.  Fundamentals of fault-tolerant distributed computing in asynchronous environments , 1999, CSUR.

[17]  Jack Dongarra,et al.  Application-specific tools , 1998 .

[18]  Matthew C. Elder,et al.  Fault tolerance in critical information systems , 2001 .

[19]  Peter Steenkiste,et al.  Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery , 1993 .

[20]  Amit P. Sheth,et al.  Managing heterogeneous multi-system tasks to support enterprise-wide operations , 1995, Distributed and Parallel Databases.

[21]  Andrzej Duda,et al.  The Effects of Checkpointing on Program Execution Time , 1983, Inf. Process. Lett..

[22]  Warren Smith,et al.  A Resource Management Architecture for Metacomputing Systems , 1998, JSSPP.

[23]  James S. Plank,et al.  Experimental assessment of workstation failures and their impact on checkpointing systems , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[24]  S. Wittevrongel,et al.  Queueing Systems , 2019, Introduction to Stochastic Processes and Simulation.