A Taxonomy and Survey of Fault-Tolerant Workflow Management Systems in Cloud and Distributed Computing Environments

During the recent years, workflows have emerged as an important abstraction for collaborative research and managing complex large-scale distributed data analytics. Workflows are increasingly becoming prevalent in various distributed environments, such as clusters, grids, and clouds. These environments provide complex infrastructures that aid workflows in scaling and parallel execution of their components. However, they are prone to performance variations and different types of failures. Thus, workflow management systems need to be robust against performance variations and tolerant against failures. Numerous research studies have investigated fault-tolerant aspect of the workflow management system in different distributed systems. In this study, we analyze these efforts and provide an in-depth taxonomy of them. We present the ontology of faults and fault-tolerant techniques then position the existing workflow management systems with respect to the taxonomies and the techniques. In addition, we classify various failure models, metrics, tools, and support systems. Finally, we identify and discuss the strengths and weaknesses of the current techniques and provide recommendations on future directions and open areas for the research community.

[1]  Bertram Ludäscher,et al.  Kepler: an extensible system for design and execution of scientific workflows , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[2]  Bertram Ludäscher,et al.  Provenance in Scientific Workflow Systems , 2007, IEEE Data Eng. Bull..

[3]  Ewa Deelman,et al.  Fault Tolerant Clustering in Scientific Workflows , 2012, 2012 IEEE Eighth World Congress on Services.

[4]  Rajkumar Buyya,et al.  A taxonomy of scientific workflow systems for grid computing , 2005, SGMD.

[5]  Rajkumar Buyya,et al.  Aneka: a Software Platform for .NET based Cloud Computing , 2009, High Performance Computing Workshop.

[6]  Xiao Liu,et al.  SwinDeW-C: A Peer-to-Peer Based Cloud Workflow System , 2010, Handbook of Cloud Computing.

[7]  Y.-K. Kwok,et al.  Static scheduling algorithms for allocating directed task graphs to multiprocessors , 1999, CSUR.

[8]  Péter Kacsuk,et al.  Multi-Grid, Multi-User Workflows in the P-GRADE Grid Portal , 2005, Journal of Grid Computing.

[9]  Gregor von Laszewski,et al.  Java CoG Kit Workflow , 2007, Workflows for e-Science, Scientific Workflows for Grids.

[10]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[11]  Felix C. Gärtner,et al.  Fundamentals of fault-tolerant distributed computing in asynchronous environments , 1999, CSUR.

[12]  Emmanuel Jeannot,et al.  Evaluation and Optimization of the Robustness of DAG Schedules in Heterogeneous Environments , 2010, IEEE Transactions on Parallel and Distributed Systems.

[13]  Rajkumar Buyya,et al.  Failure-aware resource provisioning for hybrid Cloud infrastructure , 2012, J. Parallel Distributed Comput..

[14]  Yun Yang,et al.  Robust Scheduling of Scientific Workflows with Deadline and Budget Constraints in Clouds , 2014, 2014 IEEE 28th International Conference on Advanced Information Networking and Applications.

[15]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[16]  Rajkumar Buyya,et al.  Fault-tolerant Workflow Scheduling using Spot Instances on Clouds , 2014, ICCS.

[17]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[18]  Andrey Brito,et al.  Low-Overhead Fault Tolerance for High-Throughput Data Processing Systems , 2011, 2011 31st International Conference on Distributed Computing Systems.

[19]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[20]  Jan Van Damme,et al.  Project scheduling under uncertainty survey and research potentials , 2002 .

[21]  Christopher E. Dabrowski,et al.  Reliability in grid computing systems , 2009, Concurr. Comput. Pract. Exp..

[22]  David Charles De Roure,et al.  myExperiment: social networking for workflow-using e-scientists , 2007, WORKS '07.

[23]  Rajkumar Buyya,et al.  Optimizing the makespan and reliability for workflow applications with reputation and a look-ahead genetic algorithm , 2011, Future Gener. Comput. Syst..

[24]  Richard D. Schlichting,et al.  Fail-Stop Processors: An Approach to Designing Computing Systems , 1983 .

[25]  Yang Zhang,et al.  Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[26]  Cláudio T. Silva,et al.  VisTrails: visualization meets data management , 2006, SIGMOD Conference.

[27]  Yves Robert,et al.  Reliability of task graph schedules with transient and fail-stop failures: complexity and algorithms , 2012, J. Sched..

[28]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[29]  Jinjun Chen,et al.  Trust-based robust scheduling and runtime adaptation of scientific workflow , 2009 .

[30]  Ali Movaghar-Rahimabadi,et al.  Bi-level fuzzy based advanced reservation of Cloud workflow applications on distributed Grid resources , 2013, The Journal of Supercomputing.

[31]  Rajkumar Buyya,et al.  Meeting Deadlines of Scientific Workflows in Public Clouds with Tasks Replication , 2014, IEEE Transactions on Parallel and Distributed Systems.

[32]  Dharma P. Agrawal,et al.  A task duplication based scheduling algorithm for heterogeneous systems , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[33]  Ian J. Taylor,et al.  The Triana Workflow Environment: Architecture and Applications , 2007, Workflows for e-Science, Scientific Workflows for Grids.

[34]  Weisong Shi,et al.  An Adaptive Rescheduling Strategy for Grid Workflow Applications , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[35]  Jie Li,et al.  Fault Tolerance and Scaling in e-Science Cloud Applications: Observations from the Continuing Development of MODISAzure , 2010, 2010 IEEE Sixth International Conference on e-Science.

[36]  Mladen A. Vouk,et al.  Cloud Computing – Issues, Research and Implementations , 2008, CIT 2008.

[37]  Dimitrios Skoutas,et al.  Efficient task replication and management for adaptive fault tolerance in Mobile Grid environments , 2007, Future Gener. Comput. Syst..

[38]  Kenta Hashimoto Effective Scheduling of Duplicated Tasks for Fault Tolerance in Multiprocessor Systems , 2002 .

[39]  Alexandru Iosup,et al.  The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[40]  Shi Mei WFMS:WORKFLOW MANAGEMENT SYSTEM , 1999 .

[41]  Robert H. Storer,et al.  Robustness Measures and Robust Scheduling for Job Shops , 1994 .

[42]  Paul Watson,et al.  Developing cloud applications using the e-Science Central platform , 2013, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[43]  Yves Robert,et al.  Fault tolerant scheduling of precedence task graphs on heterogeneous platforms , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[44]  Francine Berman,et al.  New Grid Scheduling and Rescheduling Methods in the GrADS Project , 2004, IPDPS Next Generation Software Program - NSFNGS - PI Workshop.

[45]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[46]  Rajkumar Buyya,et al.  Taxonomy of Contention Management in Interconnected Distributed Systems , 2014, Computing Handbook, 3rd ed..

[47]  Jing Li,et al.  Trust-driven and QoS demand clustering analysis based cloud workflow scheduling strategies , 2014, Cluster Computing.

[48]  Yogesh Simmhan,et al.  Building the Trident Scientific Workflow Workbench for Data Management in the Cloud , 2009, 2009 Third International Conference on Advanced Engineering Computing and Applications in Sciences.

[49]  Carole A. Goble,et al.  The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud , 2013, Nucleic Acids Res..

[50]  Emmanuel Jeannot,et al.  Bi-objective scheduling algorithms for optimizing makespan and reliability on heterogeneous systems , 2007, SPAA '07.

[51]  Yogesh L. Simmhan,et al.  A survey of data provenance in e-science , 2005, SGMD.

[52]  Rami G. Melhem,et al.  Analysis of a fault-tolerant multiprocessor scheduling algorithm , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[53]  Yves Robert,et al.  Multi-criteria Scheduling of Precedence Task Graphs on Heterogeneous Platforms , 2010, Comput. J..

[54]  Francisco Vilar Brasileiro,et al.  On the efficacy, efficiency and emergent behavior of task replication in large distributed systems , 2007, Parallel Comput..

[55]  Ali Afzal,et al.  Workflow Enactment in ICENI , 2004 .

[56]  Xinguang Peng,et al.  Trust-Based Scheduling Strategy for Workflow Applications in Cloud Environment , 2013, 2013 Eighth International Conference on P2P, Parallel, Grid, Cloud and Internet Computing.

[57]  Schahram Dustdar,et al.  Service mediation and negotiation bootstrapping as first achievements towards self-adaptable grid and cloud services , 2009, GMAC '09.

[58]  Andreas Schreiber,et al.  DataFinder – A Scientific Data Management Solution , 2007 .

[59]  Omer F. Rana,et al.  An uncoordinated asynchronous checkpointing model for hierarchical scientific workflows , 2010, J. Comput. Syst. Sci..

[60]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[61]  Rajkumar Buyya,et al.  A grid workflow environment for brain imaging analysis on distributed systems , 2009 .

[62]  Radu Prodan,et al.  DEE: A Distributed Fault Tolerant Workflow Enactment Engine for Grid Computing , 2005, HPCC.

[63]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[64]  Domenico Talia,et al.  A Taxonomy for the Analysis of Scientific Workflow Faults , 2010, 2010 13th IEEE International Conference on Computational Science and Engineering.

[65]  Bran Selic,et al.  A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems , 2013, The Journal of Supercomputing.

[66]  Indranil Gupta,et al.  Making cloud intermediate data fault-tolerant , 2010, SoCC '10.

[67]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[68]  Soonwook Hwang,et al.  Grid workflow: a flexible failure handling framework for the grid , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.

[69]  Kaizar Amin,et al.  GridAnt: a client-controllable grid workflow system , 2004, 37th Annual Hawaii International Conference on System Sciences, 2004. Proceedings of the.

[70]  Rajkumar Buyya,et al.  A framework for ranking of cloud computing services , 2013, Future Gener. Comput. Syst..

[71]  Geoffrey C. Fox,et al.  Examining the Challenges of Scientific Workflows , 2007, Computer.

[72]  Anthony A. Maciejewski,et al.  A Stochastic Approach to Measuring the Robustness of Resource Allocations in Distributed Systems , 2006, 2006 International Conference on Parallel Processing (ICPP'06).

[73]  Jun Qin,et al.  ASKALON: a Grid application development and computing environment , 2005, The 6th IEEE/ACM International Workshop on Grid Computing, 2005..

[74]  Xiao Liu,et al.  Cloud Data Management for Scientific Workflows: Research Issues, Methodologies, and State-of-the-Art , 2014, 2014 10th International Conference on Semantics, Knowledge and Grids.

[75]  WenAn Tan,et al.  A Trust Service-Oriented Scheduling Model for Workflow Applications in Cloud Computing , 2014, IEEE Systems Journal.

[76]  Rajkumar Buyya,et al.  Contention management in federated virtualized distributed systems: implementation and evaluation , 2014, Softw. Pract. Exp..

[77]  Emmanuel Jeannot,et al.  Robust task scheduling in non-deterministic heterogeneous computing systems , 2006, 2006 IEEE International Conference on Cluster Computing.

[78]  Yun Yang,et al.  A Novel Cost-Effective Dynamic Data Replication Strategy for Reliability in Cloud Data Centres , 2011, 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing.

[79]  Jun Qin,et al.  ASKALON: A Development and Grid Computing Environment for Scientific Workflows , 2007, Workflows for e-Science, Scientific Workflows for Grids.

[80]  Ladislau Bölöni,et al.  Robust scheduling of metaprograms , 2002 .

[81]  Jinjun Chen,et al.  Adaptive selection of necessary and sufficient checkpoints for dynamic verification of temporal constraints in grid workflow systems , 2007, TAAS.

[82]  Mladen A. Vouk,et al.  A Fault-Tolerance Architecture for Kepler-Based Distributed Scientific Workflows , 2010, SSDBM.

[83]  Chuck Lam,et al.  Hadoop in Action , 2010 .

[84]  Rizos Sakellariou,et al.  A low-cost rescheduling policy for efficient mapping of workflows on grid systems , 2004, Sci. Program..

[85]  Radu Prodan,et al.  Fault Detection, Prevention and Recovery in Current Grid Workflow Systems , 2008, CoreGRID Workshop on Grid Middleware.

[86]  Daniel A. Reed,et al.  Fault Tolerance and Recovery of Scientific Workflows on Computational Grids , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[87]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[88]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[89]  Kenli Li,et al.  List scheduling with duplication for heterogeneous computing systems , 2010, J. Parallel Distributed Comput..

[90]  Rajkumar Buyya,et al.  Cloudbus Toolkit for Market-Oriented Cloud Computing , 2009, CloudCom.

[91]  Rajkumar Buyya,et al.  Workflow scheduling algorithms for grid computing , 2008 .

[92]  Dharma P. Agrawal,et al.  A Task Duplication Based Optimal Scheduling Algorithm for Variable Execution Time Tasks , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[93]  Rajkumar Buyya,et al.  Resource provisioning based on preempting virtual machines in distributed systems , 2014, Concurr. Comput. Pract. Exp..

[94]  Ewa Deelman,et al.  Scientific workflows and clouds , 2010, ACM Crossroads.

[95]  Juliana Freire,et al.  Provenance and scientific workflows: challenges and opportunities , 2008, SIGMOD Conference.

[96]  Jeffrey Dean,et al.  Keynote talk: Experiences with MapReduce, an abstraction for large-scale computation , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[97]  Miron Livny,et al.  Pegasus, a workflow management system for science automation , 2015, Future Gener. Comput. Syst..

[98]  Atakan Dogan,et al.  LDBS: a duplication based scheduling algorithm for heterogeneous computing systems , 2002, Proceedings International Conference on Parallel Processing.

[99]  Subhash Saini,et al.  GridFlow: workflow management for grid computing , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[100]  Matthew R. Pocock,et al.  Taverna: a tool for the composition and enactment of bioinformatics workflows , 2004, Bioinform..

[101]  Rajkumar Buyya,et al.  Reputation-based dependable scheduling of workflow applications in Peer-to-Peer Grids , 2010, Comput. Networks.

[102]  Özge Alaçam,et al.  A Usability Study of WebMaps with Eye Tracking Tool: The Effects of Iconic Representation of Information , 2009, HCI.

[103]  Rajkumar Buyya,et al.  Workflow Engine for Clouds , 2011, CloudCom 2011.

[104]  Björn Hagemeier,et al.  UNICORE 6 — Recent and Future Advancements , 2010, Ann. des Télécommunications.

[105]  GhemawatSanjay,et al.  The Google file system , 2003 .

[106]  Johan Tordsson,et al.  A Light-Weight Grid Workflow Execution Engine Enabling Client and Middleware Independence , 2007, PPAM.

[107]  Miron Livny,et al.  Data placement for scientific applications in distributed environments , 2007, 2007 8th IEEE/ACM International Conference on Grid Computing.

[108]  Martin Odersky,et al.  Programming in Scala: A Comprehensive Step-by-Step Guide, 2nd Edition , 2010 .

[109]  Rajkumar Buyya,et al.  Designing a resource broker for heterogeneous grids , 2008, Softw. Pract. Exp..

[110]  Ann L. Chervenak,et al.  Characterizing and profiling scientific workflows , 2013, Future Gener. Comput. Syst..

[111]  Anthony A. Maciejewski,et al.  Robust Resource Allocation in Heterogeneous Parallel and Distributed Computing Systems , 2008, Wiley Encyclopedia of Computer Science and Engineering.

[112]  R. Prodan,et al.  Meeting Soft Deadlines in Scientific Workflows Using Resubmission Impact , 2012, IEEE Transactions on Parallel and Distributed Systems.