Provenance-based fault tolerance technique recommendation for cloud-based scientific workflows: a practical approach

Scientific workflows are abstractions composed of activities, data and dependencies that model a computer simulation and are managed by complex engines named scientific workflow management system (SWfMS). Many workflows demand many computational resources once their executions may involve a number of different programs processing a massive volume of data. Thus, the use of high-performance computing (HPC) and data-intensive scalable computing environments allied to parallelization techniques provides the necessary support for the execution of such workflows. Clouds are environments that already offer HPC capabilities and workflows can explore them. Although clouds offer advantages such as elasticity and availability, failures are a reality rather than a possibility in this environment. Thus, existing SWfMS must be fault-tolerant. There are several types of fault tolerance techniques used in SWfMS such as Checkpoint/Restart, Re-Execution and Over-provisioning, but it is far from trivial to choose the suitable fault tolerance technique for a workflow execution that is not going to jeopardize the parallel execution. The major problem is that the suitable fault tolerance technique may be different for each workflow, activity or activation since programs associated with activities may present different behaviors. This article aims at analyzing several fault-tolerance techniques in a cloud-based SWfMS named SciCumulus, and recommend the suitable one for user’s workflow activities and activations using machine learning techniques and provenance data, thus aiming at improving resiliency.

[1]  Andreas Holzinger,et al.  Data Mining with Decision Trees: Theory and Applications , 2015, Online Inf. Rev..

[2]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[3]  Ann L. Chervenak,et al.  Characterizing and profiling scientific workflows , 2013, Future Gener. Comput. Syst..

[4]  Tom M. Mitchell,et al.  Generalization as Search , 2002 .

[5]  Hemesh Bhardwaj,et al.  Software Fault Prediction using Machine Learning Techniques , 2018 .

[6]  Christian Engelmann,et al.  Proactive Fault Tolerance Using Preemptive Migration , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[7]  Tatiana A. Tatusova,et al.  NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy , 2011, Nucleic Acids Res..

[8]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[9]  Marta Mattoso,et al.  Towards supporting the life cycle of large scale scientific experiments , 2010, Int. J. Bus. Process. Integr. Manag..

[10]  Domenico Talia,et al.  Grid and Services Evolution , 2008 .

[11]  Chase Qishi Wu,et al.  Distributed Throughput Optimization for Large-Scale Scientific Workflows Under Fault-Tolerance Constraint , 2013, Journal of Grid Computing.

[12]  Iker Gondra,et al.  Applying machine learning to software fault-proneness prediction , 2008, J. Syst. Softw..

[13]  Boualem Benatallah,et al.  CoreDB: a Data Lake Service , 2017, CIKM.

[14]  Miron Livny,et al.  Pegasus, a workflow management system for science automation , 2015, Future Gener. Comput. Syst..

[15]  Miron Livny,et al.  The cost of doing science on the cloud: The Montage example , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Marta Mattoso,et al.  A Provenance-based Adaptive Scheduling Heuristic for Parallel Scientific Workflows in Clouds , 2012, Journal of Grid Computing.

[17]  Yolanda Gil,et al.  Pegasus: Mapping Scientific Workflows onto the Grid , 2004, European Across Grids Conference.

[18]  Randy Kerber,et al.  ChiMerge: Discretization of Numeric Attributes , 1992, AAAI.

[19]  G. Bruce Berriman,et al.  On the Use of Cloud Computing for Scientific Workflows , 2008, 2008 IEEE Fourth International Conference on eScience.

[20]  Marta Mattoso,et al.  Using Ontologies to Support Deep Water Oil Exploration Scientific Workflows , 2009, 2009 Congress on Services - I.

[21]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[22]  Ruchika Malhotra,et al.  A systematic review of machine learning techniques for software fault prediction , 2015, Appl. Soft Comput..

[23]  Marta Mattoso,et al.  Chiron: a parallel engine for algebraic scientific workflows , 2013, Concurr. Comput. Pract. Exp..

[24]  Yang Zhang,et al.  Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[25]  Yang Wang,et al.  Adaptive Scheduling of Task Graphs with Dynamic Resilience , 2017, IEEE Transactions on Computers.

[26]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[27]  J. Wolf,et al.  A field guide to whole-genome sequencing, assembly and annotation , 2014, Evolutionary applications.

[28]  Ewa Deelman,et al.  Producing an Infrared Multiwavelength Galactic Plane Atlas Using Montage, Pegasus, and Amazon Web Services , 2014 .

[29]  Zheng-Ou Wang,et al.  An entropy-based discretization method for classification rules with inconsistency checking , 2002, Proceedings. International Conference on Machine Learning and Cybernetics.

[30]  Chris Rose,et al.  A Break in the Clouds: Towards a Cloud Definition , 2011 .

[31]  Salim Hariri,et al.  Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[32]  Marta Mattoso,et al.  SciCumulus: A Lightweight Cloud Middleware to Explore Many Task Computing Paradigm in Scientific Workflows , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[33]  Been Kim,et al.  Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.

[34]  Johan Tordsson,et al.  Hybrid Adaptive Checkpointing for Virtual Machine Fault Tolerance , 2018, 2018 IEEE International Conference on Cloud Engineering (IC2E).

[35]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[36]  Franck Cappello,et al.  Optimization of cloud task processing with checkpoint-restart mechanism , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[37]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[38]  Rizos Sakellariou,et al.  Mapping Workflows on Grid Resources: Experiments with the Montage Workflow , 2009, CoreGRID@Euro-Par.

[39]  Casimir A. Kulikowski,et al.  Computer Systems That Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning and Expert Systems , 1990 .

[40]  David J. Leinweber,et al.  Stupid Data Miner Tricks , 2007 .

[41]  Marta Mattoso,et al.  Multi-objective scheduling of Scientific Workflows in multisite clouds , 2016, Future Gener. Comput. Syst..

[42]  John Chilton,et al.  Galaxy Cluster to Cloud - Genomics at Scale , 2014, 2014 9th Gateway Computing Environments Workshop.

[43]  Inderveer Chana,et al.  Autonomic fault tolerant scheduling approach for scientific workflows in Cloud computing , 2015, Concurr. Eng. Res. Appl..

[44]  Paul Watson,et al.  e‐Science Central for CARMEN: science as a service , 2010, Concurr. Comput. Pract. Exp..

[45]  Ewa Deelman,et al.  WorkflowSim: A toolkit for simulating scientific workflows in distributed environments , 2012, 2012 IEEE 8th International Conference on E-Science.

[46]  Wei Chen,et al.  FireWorks: a dynamic workflow system designed for high‐throughput applications , 2015, Concurr. Comput. Pract. Exp..

[47]  Ian H. Witten,et al.  Weka-A Machine Learning Workbench for Data Mining , 2005, Data Mining and Knowledge Discovery Handbook.

[48]  Cláudio T. Silva,et al.  Provenance for Computational Tasks: A Survey , 2008, Computing in Science & Engineering.

[49]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[50]  Patrick Valduriez,et al.  OpenAlea: scientific workflows combining data analysis and simulation , 2015, SSDBM.

[51]  Daniel S. Katz,et al.  Swift/T: Large-Scale Application Composition via Distributed-Memory Dataflow Processing , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[52]  Francisco Herrera,et al.  Data Preprocessing in Data Mining , 2014, Intelligent Systems Reference Library.

[53]  David E. Culler,et al.  Analysis of multithreaded architectures for parallel computing , 1990, SPAA '90.

[54]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[55]  Marta Mattoso,et al.  Enabling Re-executions of Parallel Scientific Workflows Using Runtime Provenance Data , 2012, IPAW.

[56]  Christian Engelmann,et al.  A Proactive Fault Tolerance Framework for High-Performance Computing , 2010 .

[57]  Johan Tordsson,et al.  A Light-Weight Grid Workflow Execution Engine Enabling Client and Middleware Independence , 2007, PPAM.

[58]  Lúcia Maria de A. Drummond,et al.  Eeny Meeny Miny Moe: Choosing the Fault Tolerance Technique for my Cloud Workflow , 2017, CARLA.

[59]  Radu Prodan,et al.  Scheduling of scientific workflows in the ASKALON grid environment , 2005, SGMD.

[60]  Pavel A Pevzner,et al.  How to apply de Bruijn graphs to genome assembly. , 2011, Nature biotechnology.

[61]  Dirk Van den Poel,et al.  Random Multiclass Classification: Generalizing Random Forests to Random MNL and Random NB , 2007, DEXA.

[62]  Peter Clark,et al.  The CN2 Induction Algorithm , 1989, Machine Learning.

[63]  Xindong Wu,et al.  Discretization Methods , 2010, Data Mining and Knowledge Discovery Handbook.

[64]  J. Ross Quinlan,et al.  Simplifying decision trees , 1987, Int. J. Hum. Comput. Stud..

[65]  Marta Mattoso,et al.  SciPhy: A Cloud-Based Workflow for Phylogenetic Analysis of Drug Targets in Protozoan Genomes , 2011, BSB.

[66]  Miron Livny,et al.  Online Task Resource Consumption Prediction for Scientific Workflows , 2015, Parallel Process. Lett..

[67]  Che-Rung Lee,et al.  Optimizing Back-and-Forth Live Migration , 2016, 2016 IEEE/ACM 9th International Conference on Utility and Cloud Computing (UCC).

[68]  Felix C. Freiling,et al.  Fundamentals of Fault-Tolerant Distributed Computing in Asynchronous Environments , 1999, ACM Comput. Surv..

[69]  Gregory Gutin,et al.  When the greedy algorithm fails , 2004, Discret. Optim..