An Efficient Fault Tolerant Workflow Scheduling Approach using Replication Heuristics and Checkpointing in the Cloud

Abstract Scientific workflows have been predominantly used for complex and large scale data analysis and scientific computation/automation and the need for robust workflow scheduling techniques has grown considerably. But, most of the existing workflow scheduling algorithms do not provide the required reliability and robustness. In this paper, a new fault tolerant workflow scheduling algorithm that learns replication heuristics in an unsupervised manner has been proposed. Furthermore, the use of light weight synchronized checkpointing enables efficient resubmission of failed tasks and ensures workflow completion even in precarious environments. The proposed technique improves upon metrics like Resource Wastage and Resource Usage in comparison to the Replicate-All algorithm, while maintaining an acceptable increase in Makespan as compared to the vanilla Heterogeneous Earliest Finish Time (HEFT).

[1]  Salim Hariri,et al.  Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[2]  Radu Prodan,et al.  Scheduling of scientific workflows in the ASKALON grid environment , 2005, SGMD.

[3]  Ishfaq Ahmad,et al.  Benchmarking and Comparison of the Task Graph Scheduling Algorithms , 1999, J. Parallel Distributed Comput..

[4]  Boontee Kruatrachue,et al.  Grain size determination for parallel processing , 1988, IEEE Software.

[5]  Radu Prodan,et al.  A New Fault Tolerance Heuristic for Scientific Workflows in Highly Distributed Environments Based on Resubmission Impact , 2009, 2009 Fifth IEEE International Conference on e-Science.

[6]  Rajkumar Buyya,et al.  A Taxonomy and Survey of Fault-Tolerant Workflow Management Systems in Cloud and Distributed Computing Environments , 2017 .

[7]  Radu Prodan,et al.  DEE: A Distributed Fault Tolerant Workflow Enactment Engine for Grid Computing , 2005, HPCC.

[8]  I. Jolliffe Principal Component Analysis , 2002 .

[9]  Dhruv Batra,et al.  Joint Unsupervised Learning of Deep Representations and Image Clusters , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Yun Yang,et al.  Robust Scheduling of Scientific Workflows with Deadline and Budget Constraints in Clouds , 2014, 2014 IEEE 28th International Conference on Advanced Information Networking and Applications.

[12]  Y.-K. Kwok,et al.  Static scheduling algorithms for allocating directed task graphs to multiprocessors , 1999, CSUR.

[13]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[14]  Ewa Deelman,et al.  Resource management for scientific workflows , 2012 .

[15]  Miron Livny,et al.  Pegasus, a workflow management system for science automation , 2015, Future Gener. Comput. Syst..

[16]  Daniel A. Reed,et al.  Fault Tolerance and Recovery of Scientific Workflows on Computational Grids , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[17]  Ewa Deelman,et al.  Fault Tolerant Clustering in Scientific Workflows , 2012, 2012 IEEE Eighth World Congress on Services.

[18]  Franck Cappello,et al.  FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[20]  Ewa Deelman,et al.  Grids and Clouds: Making Workflow Applications Work in Heterogeneous Distributed Environments , 2010, Int. J. High Perform. Comput. Appl..

[21]  Rajkumar Buyya,et al.  A Taxonomy of Workflow Management Systems for Grid Computing , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[22]  G. Bruce Berriman,et al.  On the Use of Cloud Computing for Scientific Workflows , 2008, 2008 IEEE Fourth International Conference on eScience.

[23]  Thomas Fahringer,et al.  Fault-tolerant behavior in state-of-the-art grid workflow management systems , 2007 .

[24]  Yang Zhang,et al.  Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[25]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[26]  Soonwook Hwang,et al.  Grid workflow: a flexible failure handling framework for the grid , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.

[27]  James J. Filliben,et al.  NIST/SEMATECH e-Handbook of Statistical Methods; Chapter 1: Exploratory Data Analysis , 2003 .