A Fault-Tolerance Architecture for Kepler-Based Distributed Scientific Workflows

Fault-tolerance and failure recovery in scientific workflows is still a relatively young topic. The work done in the domain so far mostly applies classic fault-tolerance mechanisms, such as "alternative versions" and "checkpointing", to scientific workflows. Often scientific workflow systems simply rely on the fault-tolerance capabilities provided by their third party subcomponents such as schedulers, Grid resources, or the underlying operating systems. When failures occur at the underlying layers, a workflow system typically sees them only as failed steps in the process without additional detail and the ability of the system to recover from those failures may be limited. In this paper, we present an architecture that tries to address this for Kepler-based scientific workflows by providing more information about failures and faults we have observed, and through a supporting implementation of more comprehensive failure coverage and recovery options. We discuss our framework in the context of the failures observed in two production-level Kepler-based workflows, specifically XGC and S3D. The framework is divided into three major components: (i) a general contingency Kepler actor that provides a recovery block functionality at the workflow level, (ii) an external monitoring module that tracks the underlying workflow components, and monitors the overall health of the workflow execution, and (iii) a checkpointing mechanism that provides smart resume capabilities for cases in which an unrecoverable error occurs. This framework takes advantage of the provenance data collected by the Kepler-based workflows to detect failures and help in fault-tolerance decision making.

[1]  H. Kopetz,et al.  The Evolution of Fault-Tolerant Computing , 1987, Dependable Computing and Fault-Tolerant Systems.

[2]  James C. T. Pool,et al.  Grid-Based Problem Solving Environments - IFIP TC2/ WG 2.5 Working Conference on Grid-Based Problem Solving Environments: Implications for Development and Deployment of Numerical Software July 17-21, 2006, Prescott, Arizona, USA , 2007, Grid-Based Problem Solving Environments.

[3]  Brian Randell Design Fault Tolerance , 1986 .

[4]  Fabrizio Silvestri,et al.  Biological Experiments on the Grid: A Novel Workflow Management Platform , 2007, Twentieth IEEE International Symposium on Computer-Based Medical Systems (CBMS'07).

[5]  Douglas Thain,et al.  Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..

[6]  Xiaola Lin,et al.  A Variational Calculus Approach to Optimal Checkpoint Placement , 2001, IEEE Trans. Computers.

[7]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[8]  Scott Klasky,et al.  Plasma Edge Kinetic-MHD Modeling in Tokamaks Using Kepler Workflow for Code Coupling, Data Management and Visualization , 2008 .

[9]  Simon Miles Electronically Querying for the Provenance of Entities , 2006, IPAW.

[10]  Scott Klasky,et al.  Terascale direct numerical simulations of turbulent combustion using S3D , 2008 .

[11]  Matthew R. Pocock,et al.  Taverna: a tool for the composition and enactment of bioinformatics workflows , 2004, Bioinform..

[12]  Dennis Gannon,et al.  Workflows for e-Science, Scientific Workflows for Grids , 2014 .

[13]  Ian J. Taylor,et al.  The Triana Workflow Environment: Architecture and Applications , 2007, Workflows for e-Science, Scientific Workflows for Grids.

[14]  Ilkay Altintas,et al.  Provenance Collection Support in the Kepler Scientific Workflow System , 2006, IPAW.

[15]  Cláudio T. Silva,et al.  VisTrails: visualization meets data management , 2006, SIGMOD Conference.

[16]  Scott Klasky,et al.  Collaborative visualization spaces for petascale simulations , 2008, 2008 International Symposium on Collaborative Technologies and Systems.

[17]  Gregor von Laszewski,et al.  Swift: Fast, Reliable, Loosely Coupled Parallel Computation , 2007, 2007 IEEE Congress on Services (Services 2007).

[18]  Edward A. Lee,et al.  CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2000; 00:1–7 Prepared using cpeauth.cls [Version: 2002/09/19 v2.02] Taverna: Lessons in creating , 2022 .

[19]  Choong-Seock Chang,et al.  Numerical study of neoclassical plasma pedestal in a tokamak geometry , 2004 .

[20]  Arie Shoshani,et al.  Automation of Network-Based Scientific Workflows , 2007, Grid-Based Problem Solving Environments.

[21]  Algirdas Avizienis,et al.  The N-Version Approach to Fault-Tolerant Software , 1985, IEEE Transactions on Software Engineering.