A Framework for Adaptive Fault-Tolerant Execution of Workflows in the Grid: Empirical and Theoretical Analysis

In this paper, we propose and evaluate a framework for fault tolerant workflow execution in Grid environments. Different from previous work in the literature, our system dynamically chooses an appropriate fault tolerance technique while using a user-defined rule-based system. We also provide a generic interface that can be used to add fault tolerance techniques to the framework. The results obtained with real workflows in an experimental Grid environment show that the overhead introduced by our framework in a failure-free execution is, in the worst evaluated case, approximately 10 %. Moreover, we show that, using our framework, workflows are able to execute successfully in the presence of failures and that the framework can dynamically choose an appropriate fault tolerance technique. The main contributions of our work are twofold: the developed framework and the model-based dependability analysis we performed on it. The purpose in carrying out a model-based dependability analysis consists on evaluating the interaction between our framework and the distributed Grid environment beyond the physical limitations of an empirical evaluation. By doing this, we provide means to plan the assurance of QoS in the Grid resource allocation, while applying the fault-tolerance mechanisms we implement in our framework regardless of the underlying middleware.

[1]  Zhiling Lan,et al.  Exploit failure prediction for adaptive fault-tolerance in cluster computing , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[2]  H. D. Rombach,et al.  The Goal Question Metric Approach , 1994 .

[3]  Sebastián Uchitel,et al.  Synthesis of Behavioral Models from Scenarios , 2003, IEEE Trans. Software Eng..

[4]  Christel Baier,et al.  Principles of Model Checking (Representation and Mind Series) , 2008 .

[5]  Geoffrey Fox,et al.  Special Issue: Workflow in Grid Systems , 2006, Concurr. Comput. Pract. Exp..

[6]  Yoshio Tanaka,et al.  Implementation of Fault-Tolerant GridRPC Applications , 2006, Journal of Grid Computing.

[7]  Ian T. Foster,et al.  The Anatomy of the Grid: Enabling Scalable Virtual Organizations , 2001, Int. J. High Perform. Comput. Appl..

[8]  Bengt Jonsson,et al.  A logic for reasoning about time and reliability , 1990, Formal Aspects of Computing.

[9]  Francisco José da Silva e Silva,et al.  A Flexible Fault-Tolerance Mechanism for the Integrade Grid Middleware , 2007, International Conference on Networking and Services (ICNS '07).

[10]  Alba Cristina Magalhaes Alves de Melo,et al.  User-Defined Adaptive Fault-Tolerant Execution of Workflows in the Grid , 2011, 2011 IEEE 11th International Conference on Computer and Information Technology.

[11]  Radu Prodan,et al.  A New Fault Tolerance Heuristic for Scientific Workflows in Highly Distributed Environments Based on Resubmission Impact , 2009, 2009 Fifth IEEE International Conference on e-Science.

[12]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[13]  Eric Dubois,et al.  Scenario-Based Techniques for Supporting the Elaboration and the Validation of Formal Requirements , 1998, Requirements Engineering.

[14]  Ian T. Foster Globus Toolkit Version 4: Software for Service-Oriented Systems , 2005, NPC.

[15]  Donald F. Ferguson,et al.  From Open Grid Services Infrastructure to WS-Resource Framework: Refactoring and Evolution , 2004 .

[16]  Roger C. Cheung,et al.  A User-Oriented Software Reliability Model , 1978, IEEE Transactions on Software Engineering.

[17]  Soonwook Hwang,et al.  A Flexible Framework for Fault Tolerance in the Grid , 2003, Journal of Grid Computing.

[18]  Sebastián Uchitel,et al.  Incremental elaboration of scenario-based specifications and behavior models using implied scenarios , 2004, TSEM.

[19]  Daniel A. Reed,et al.  Fault Tolerance and Recovery of Scientific Workflows on Computational Grids , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[20]  Edward A. Lee,et al.  CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2000; 00:1–7 Prepared using cpeauth.cls [Version: 2002/09/19 v2.02] Taverna: Lessons in creating , 2022 .

[21]  Jonathan Crabtree,et al.  Ergatis: a web interface and scalable software system for bioinformatics workflows , 2010, Bioinform..

[22]  C. A. R. Hoare,et al.  Communicating sequential processes , 1978, CACM.

[23]  Norman W. Paton,et al.  The design and implementation of Grid database services in OGSA‐DAI , 2005, Concurr. Pract. Exp..

[24]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[25]  Rajkumar Buyya,et al.  A taxonomy of scientific workflow systems for grid computing , 2005, SGMD.

[26]  Minglu Li,et al.  A QoS-Aware and Fault-Tolerant Workflow Composition for Grid , 2008, 2008 Seventh International Conference on Grid and Cooperative Computing.

[27]  Yang Zhang,et al.  Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[28]  Genaína Nunes Rodrigues,et al.  Dependability analysis in the Ambient Assisted Living Domain: An exploratory case study , 2012, J. Syst. Softw..

[29]  Dang Minh Quan Error recovery mechanism for grid-based workflow within SLA context , 2007, Int. J. High Perform. Comput. Netw..

[30]  John J. Marciniak,et al.  Encyclopedia of Software Engineering , 1994, Encyclopedia of Software Engineering.

[31]  David S. Rosenblum,et al.  Using Scenarios to Predict the Reliability of Concurrent Component-Based Software Systems , 2005, FASE.

[32]  Jason Maassen,et al.  Programming Scientific and Distributed Workflow with Triana Services , 2004 .

[33]  Aleksander Slomiski,et al.  On using BPEL extensibility to implement OGSI and WSRF Grid workflows , 2006, Concurr. Comput. Pract. Exp..

[34]  Fabio Kon,et al.  Application execution management on the InteGrade opportunistic grid middleware , 2010, J. Parallel Distributed Comput..

[35]  Omer F. Rana,et al.  Adaptive exception handling for scientific workflows , 2010, Concurr. Comput. Pract. Exp..

[36]  Dietmar W. Erwin,et al.  UNICORE—a Grid computing environment , 2002, Concurr. Comput. Pract. Exp..

[37]  Andrea Bianco,et al.  Model Checking of Probabalistic and Nondeterministic Systems , 1995, FSTTCS.

[38]  Marta Z. Kwiatkowska,et al.  PRISM 2.0: a tool for probabilistic model checking , 2004, First International Conference on the Quantitative Evaluation of Systems, 2004. QEST 2004. Proceedings..

[39]  Jinjun Chen,et al.  Trust-based robust scheduling and runtime adaptation of scientific workflow , 2009 .

[40]  Liang Chen,et al.  Grid Service Orchestration Using the Business Process Execution Language (BPEL) , 2005, Journal of Grid Computing.