Dynamic and Fault-Tolerant Clustering for Scientific Workflows

Task clustering has proven to be an effective method to reduce execution overhead and to improve the computational granularity of scientific workflow tasks executing on distributed resources. However, a job composed of multiple tasks may have a higher risk of suffering from failures than a single task job. In this paper, we conduct a theoretical analysis of the impact of transient failures on the runtime performance of scientific workflow executions. We propose a general task failure modeling framework that uses a maximum likelihood estimation-based parameter estimation process to model workflow performance. We further propose three fault-tolerant clustering strategies to improve the runtime performance of workflow executions in faulty execution environments. Experimental results show that failures can have significant impact on executions where task clustering policies are not fault-tolerant, and that our solutions yield makespan improvements in such scenarios. In addition, we propose a dynamic task clustering strategy to optimize the workflow's makespan by dynamically adjusting the clustering granularity when failures arise. A trace-based simulation of five real workflows shows that our dynamic method is able to adapt to unexpected behaviors, and yields better makespans when compared to static methods.

[1]  Gargi Dasgupta,et al.  Distributed and Adaptive Execution of Condor DAGMan Workflows , 2010, SEKE.

[2]  Teck Chaw Ling,et al.  A Bandwidth-Aware Job Grouping-Based Scheduling on Grid Environment , 2009 .

[3]  C. Eswaran,et al.  An Adaptive And Parameterized Job Grouping Algorithm For Scheduling Grid Jobs , 2008, 2008 10th International Conference on Advanced Communication Technology.

[4]  J Montagnat,et al.  Workflow-based comparison of two Distributed Computing Infrastructures , 2010, The 5th Workshop on Workflows in Support of Large-Scale Science.

[5]  Radu Prodan,et al.  Fault Detection, Prevention and Recovery in Current Grid Workflow Systems , 2008, CoreGRID Workshop on Grid Middleware.

[6]  Daniel A. Reed,et al.  Fault Tolerance and Recovery of Scientific Workflows on Computational Grids , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[7]  Daniel S. Katz,et al.  Montage: a grid-enabled engine for delivering custom science-grade mosaics on demand , 2004, SPIE Astronomical Telescopes + Instrumentation.

[8]  John Bresnahan,et al.  Managing appliance launches in infrastructure clouds , 2011, TG.

[9]  Chris Fleizach CSE 262 Readings : May 11 . 2006 Task Scheduling Strategies for Workflow based Applications in Grids , 2015 .

[10]  Hui Li,et al.  Efficient response time predictions by exploiting application and resource state similarities , 2005, The 6th IEEE/ACM International Workshop on Grid Computing, 2005..

[11]  Yang Zhang,et al.  Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[12]  Douglas Thain,et al.  Toward fine-grained online task characteristics estimation in scientific workflows , 2013, WORKS@SC.

[13]  Radu Prodan,et al.  Scheduling of scientific workflows in the ASKALON grid environment , 2005, SGMD.

[14]  Mark S. Squillante,et al.  Performance Implications of Failures in Large-Scale Cluster Scheduling , 2004, JSSPP.

[15]  Radu Prodan,et al.  Run-time Optimisation of Grid Workflow Applications , 2006, 2006 7th IEEE/ACM International Conference on Grid Computing.

[16]  Ewa Deelman,et al.  Failure prediction and localization in large scientific workflows , 2011, WORKS '11.

[17]  Quan Liu,et al.  Grouping-Based Fine-Grained Job Scheduling in Grid Computing , 2009, 2009 First International Workshop on Education Technology and Computer Science.

[18]  Rajkumar Buyya,et al.  A Dynamic Job Grouping-Based Scheduling for Deploying Applications with Fine-Grained Tasks on Global Grids , 2005, ACSW.

[19]  Mark S. Squillante,et al.  Failure data analysis of a large-scale heterogeneous server environment , 2004, International Conference on Dependable Systems and Networks, 2004.

[20]  Ming Wu,et al.  Grid Harvest Service: a system for long-term, application-level task scheduling , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[21]  G. Bruce Berriman,et al.  The Application of Cloud Computing to Astronomy: A Study of Cost and Performance , 2010, 2010 Sixth IEEE International Conference on e-Science Workshops.

[22]  C. Kesselman,et al.  CyberShake: A Physics-Based Seismic Hazard Model for Southern California , 2011 .

[23]  Tristan Glatard,et al.  On-Line, Non-clairvoyant Optimization of Workflow Activity Granularity on Grids , 2013, Euro-Par.

[24]  Ewa Deelman,et al.  Fault Tolerant Clustering in Scientific Workflows , 2012, 2012 IEEE Eighth World Congress on Services.

[25]  Rizos Sakellariou,et al.  Balanced Task Clustering in Scientific Workflows , 2013, 2013 IEEE 9th International Conference on e-Science.

[26]  S. Nadarajah A Review of Results on Sums of Random Variables , 2008 .

[27]  Ewa Deelman,et al.  Integration of Workflow Partitioning and Resource Provisioning , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[28]  Radu Prodan,et al.  A New Fault Tolerance Heuristic for Scientific Workflows in Highly Distributed Environments Based on Resubmission Impact , 2009, 2009 Fifth IEEE International Conference on e-Science.

[29]  Ravishankar K. Iyer,et al.  Failure analysis and modeling of a VAXcluster system , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[30]  Rajkumar Buyya,et al.  On-Line Task Granularity Adaptation for Dynamic Grid Applications , 2010, ICA3PP.

[31]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[32]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[33]  Daniel S. Katz,et al.  Workflow task clustering for best effort systems with Pegasus , 2008, Mardi Gras Conference.

[34]  Yolanda Gil,et al.  Pegasus: Mapping Scientific Workflows onto the Grid , 2004, European Across Grids Conference.

[35]  Daniel S. Katz,et al.  Job and data clustering for aggregate use of multiple production cyberinfrastructures , 2012, DIDC '12.

[36]  Alexandru Iosup,et al.  Analysis and modeling of time-correlated failures in large-scale distributed systems , 2010, 2010 11th IEEE/ACM International Conference on Grid Computing.

[37]  Salim Hariri,et al.  Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[38]  Ann L. Chervenak,et al.  Characterizing and profiling scientific workflows , 2013, Future Gener. Comput. Syst..

[39]  P. Diaconis,et al.  Conjugate Priors for Exponential Families , 1979 .

[40]  Radu Prodan,et al.  A Hybrid Intelligent Method for Performance Modeling and Prediction of Workflow Activities in Grids , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[41]  Rajkumar Buyya,et al.  CloudSim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms , 2011, Softw. Pract. Exp..

[42]  Miron Livny,et al.  Pegasus, a workflow management system for science automation , 2015, Future Gener. Comput. Syst..

[43]  Miron Livny,et al.  The cost of doing science on the cloud: The Montage example , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[44]  Ng Wai Keat,et al.  SCHEDULING FRAMEWORK FOR BANDWIDTH-AWARE JOB GROUPING-BASED SCHEDULING IN GRID COMPUTING , 2006 .

[45]  Alexandru Iosup,et al.  The performance of bags-of-tasks in large-scale distributed systems , 2008, HPDC '08.

[46]  Jun Qin,et al.  ASKALON: A Development and Grid Computing Environment for Scientific Workflows , 2007, Workflows for e-Science, Scientific Workflows for Grids.

[47]  Tristan Glatard,et al.  Controlling fairness and task granularity in distributed, online, non‐clairvoyant workflow executions , 2014, Concurr. Comput. Pract. Exp..

[48]  Ewa Deelman,et al.  WorkflowSim: A toolkit for simulating scientific workflows in distributed environments , 2012, 2012 IEEE 8th International Conference on E-Science.

[49]  Ewa Deelman,et al.  Workflow overhead analysis and optimizations , 2011, WORKS '11.

[50]  J. Tao,et al.  A broker-based framework for multi-cloud workflows , 2013, MultiCloud '13.

[51]  Archana Ganapathi,et al.  Why Do Internet Services Fail, and What Can Be Done About It? , 2002, USENIX Symposium on Internet Technologies and Systems.

[52]  Matthew R. Pocock,et al.  Taverna: a tool for the composition and enactment of bioinformatics workflows , 2004, Bioinform..

[53]  Tristan Glatard,et al.  A Science-Gateway Workload Archive to Study Pilot Jobs, User Activity, Bag of Tasks, Task Sub-steps, and Workflow Executions , 2012, Euro-Par Workshops.