A hybrid evolutionary algorithm for task scheduling and data assignment of data-intensive scientific workflows on clouds

A growing number of data- and compute-intensive experiments have been modeled as scientific workflows in the last decade. Meanwhile, clouds have emerged as a prominent environment to execute this type of workflows. In this scenario, the investigation of workflow scheduling strategies, aiming at reducing its execution times, became a top priority and a very popular research field. However, few work consider the problem of data file assignment when solving the task scheduling problem. Usually, a workflow is represented by a graph where nodes represent tasks and the scheduling problem consists in allocating tasks to machines to be executed at a predefined time aiming at reducing the makespan of the whole workflow. In this article, we show that the scheduling of scientific workflows can be improved when both task scheduling and the data file assignment problems are treated together. Thus, we propose a new workflow representation, where nodes of the workflow graph represent either tasks or data files, and define the Task Scheduling and Data Assignment Problem (TaSDAP), considering this new model. We formulated this problem as an integer programming problem. Moreover, a hybrid evolutionary algorithm for solving it, named HEA-TaSDAP, is also introduced. To evaluate our approach we conducted two types of experiments: theoretical and practical ones. At first, we compared HEA-TaSDAP with the solutions produced by the mathematical formulation and by other works from related literature. Then, we considered real executions in Amazon EC2 cloud using a real scientific workflow use case (SciPhy for phylogenetic analyses). In all experiments, HEA-TaSDAP outperformed the other classical approaches from the related literature, such as MinMin and HEFT. A new workflow model that considers tasks and data.The mathematical formulation of Task Scheduling and Data Assignment Problem.The design of a Hybrid Evolutionary Algorithm (HEA) for scheduling tasks and data.An extensive experimental evaluation, based on synthetic and real executions.

[1]  Dennis Gannon,et al.  Workflows for e-Science, Scientific Workflows for Grids , 2014 .

[2]  Jin-Soo Kim,et al.  Cost optimized provisioning of elastic resources for application workflows , 2011, Future Gener. Comput. Syst..

[3]  Marta Mattoso,et al.  Uncertainty quantification in numerical simulation of particle-laden flows , 2016, Computational Geosciences.

[4]  Lúcia Maria de A. Drummond,et al.  Optimizing virtual machine allocation for parallel scientific workflows in federated clouds , 2015, Future Gener. Comput. Syst..

[5]  Marta Mattoso,et al.  Data parallelism in bioinformatics workflows using Hydra , 2010, HPDC '10.

[6]  Pablo Moscato,et al.  A Modern Introduction to Memetic Algorithms , 2010 .

[7]  Rajkumar Buyya,et al.  Deadline Based Resource Provisioningand Scheduling Algorithm for Scientific Workflows on Clouds , 2014, IEEE Transactions on Cloud Computing.

[8]  C CoutinhoRafaelli de,et al.  Optimizing virtual machine allocation for parallel scientific workflows in federated clouds , 2015 .

[9]  Ann L. Chervenak,et al.  Characterizing and profiling scientific workflows , 2013, Future Gener. Comput. Syst..

[10]  Ewa Deelman,et al.  Community Resources for Enabling Research in Distributed Scientific Workflows , 2014, 2014 IEEE 10th International Conference on e-Science.

[11]  Junzhou Luo,et al.  Data Placement and Task Scheduling Optimization for Data Intensive Scientific Workflow in Multiple Data Centers Environment , 2014 .

[12]  L. Youseff,et al.  Toward a Unified Ontology of Cloud Computing , 2008, 2008 Grid Computing Environments Workshop.

[13]  Rizos Sakellariou,et al.  An Experimental Investigation into the Rank Function of the Heterogeneous Earliest Finish Time Scheduling Algorithm , 2003, Euro-Par.

[14]  Mitsuo Gen,et al.  Genetic Algorithms for Solving Multiprocessor Scheduling Problems , 1996, SEAL.

[15]  Luis Rodero-Merino,et al.  A break in the clouds: towards a cloud definition , 2008, CCRV.

[16]  Arturo Casadevall,et al.  Correction for Fang and Casadevall, Competitive Science: Is Competition Ruining Science? , 2015, Infection and Immunity.

[17]  Takeshi Yamada,et al.  Genetic Algorithms, Path Relinking, and the Flowshop Sequencing Problem , 1998, Evolutionary Computation.

[18]  Carole A. Goble,et al.  The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud , 2013, Nucleic Acids Res..

[19]  Daniel S. Katz,et al.  Swift/T: scalable data flow programming for many-task applications , 2013, PPoPP '13.

[20]  C. M. Lemos,et al.  FDFLOW: a FORTRAN-77 solver for 2-D incompressible fluid flow , 1994 .

[21]  Ewa Deelman,et al.  Storage-aware Algorithms for Scheduling of Workflow Ensembles in Clouds , 2015, Journal of Grid Computing.

[22]  Bora Uçar,et al.  Integrated data placement and task assignment for scientific workflows in clouds , 2011, DIDC '11.

[23]  Marta Mattoso,et al.  A Survey of Data-Intensive Scientific Workflow Management , 2015, Journal of Grid Computing.

[24]  Jeffrey D. Ullman,et al.  Polynomial complete scheduling problems , 1973, SOSP '73.

[25]  Elisa Heymann,et al.  Analysis of Dynamic Heuristics for Workflow Scheduling on Grid Systems , 2006, 2006 Fifth International Symposium on Parallel and Distributed Computing.

[26]  Marta Mattoso,et al.  Towards supporting the life cycle of large scale scientific experiments , 2010, Int. J. Bus. Process. Integr. Manag..

[27]  Mei-Hui Su,et al.  Characterization of scientific workflows , 2008, 2008 Third Workshop on Workflows in Support of Large-Scale Science.

[28]  Marta Mattoso,et al.  Towards a Taxonomy for Cloud Computing from an e-Science Perspective , 2010, Cloud Computing.

[29]  Sergei Vassilvitskii,et al.  Scalable K-Means by ranked retrieval , 2014, WSDM.

[30]  Alexandros Stamatakis,et al.  RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models , 2006, Bioinform..

[31]  Jun Zhang,et al.  An Ant Colony Optimization Approach to a Grid Workflow Scheduling Problem With Various QoS Requirements , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[32]  G. Bruce Berriman,et al.  Data Sharing Options for Scientific Workflows on Amazon EC2 , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[33]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[34]  Marta Mattoso,et al.  SciPhy: A Cloud-Based Workflow for Phylogenetic Analysis of Drug Targets in Protozoan Genomes , 2011, BSB.

[35]  Kenjiro Taura,et al.  File-access patterns of data-intensive workflow applications and their implications to distributed filesystems , 2010, HPDC '10.

[36]  Cláudio T. Silva,et al.  VisTrails: visualization meets data management , 2006, SIGMOD Conference.

[37]  Edward Walker,et al.  Challenges in executing large parameter sweep studies across widely distributed computing environments , 2007, CLADE '07.

[38]  Salim Hariri,et al.  Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[39]  El-Ghazali Talbi,et al.  Metaheuristics - From Design to Implementation , 2009 .

[40]  Rajkumar Buyya,et al.  Workflow scheduling algorithms for grid computing , 2008 .

[41]  Dick H. J. Epema,et al.  Deadline-constrained workflow scheduling algorithms for Infrastructure as a Service Clouds , 2013, Future Gener. Comput. Syst..

[42]  David E. Goldberg,et al.  Genetic Algorithms, Tournament Selection, and the Effects of Noise , 1995, Complex Syst..

[43]  Rajbir Singh Cheema,et al.  Comparison of Workflow Scheduling Algorithms in Cloud Computing , 2011 .

[44]  Marta Mattoso,et al.  Scientific Workflow Partitioning in Multisite Cloud , 2014, Euro-Par Workshops.

[45]  Scott Klasky,et al.  Introduction to scientific workflow management and the Kepler system , 2006, SC.

[46]  Lúcia Maria de A. Drummond,et al.  Memory aware load balance strategy on a parallel branch‐and‐bound application , 2013, Concurr. Comput. Pract. Exp..

[47]  Xiao Liu,et al.  A data placement strategy in scientific cloud workflows , 2010, Future Gener. Comput. Syst..

[48]  Quan Z. Sheng,et al.  Science in the Cloud: Allocation and Execution of Data-Intensive Scientific Workflows , 2013, Journal of Grid Computing.

[49]  Rajkumar Buyya,et al.  A Particle Swarm Optimization-Based Heuristic for Scheduling Workflow Applications in Cloud Computing Environments , 2010, 2010 24th IEEE International Conference on Advanced Information Networking and Applications.

[50]  F. Glover,et al.  Fundamentals of Scatter Search and Path Relinking , 2000 .

[51]  Rajkumar Buyya,et al.  A Dynamic Critical Path Algorithm for Scheduling Scientific Workflow Applications on Global Grids , 2007, Third IEEE International Conference on e-Science and Grid Computing (e-Science 2007).

[52]  Alberto Moraglio,et al.  Geometry of evolutionary algorithms , 2011, GECCO.

[53]  Ken Kennedy,et al.  TaskScheduling Strategies forWorkflow-based Applications inGrids , 2005 .

[54]  Arturo Casadevall,et al.  Competitive Science: Is Competition Ruining Science? , 2015, Infection and Immunity.

[55]  Cláudio T. Silva,et al.  Provenance for Computational Tasks: A Survey , 2008, Computing in Science & Engineering.

[56]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[57]  Ewa Deelman,et al.  The cost of doing science on the cloud: the Montage example , 2008, HiPC 2008.

[58]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[59]  Rajkumar Buyya,et al.  Scheduling scientific workflow applications with deadline and budget constraints using genetic algorithms , 2006, Sci. Program..

[60]  Marta Mattoso,et al.  A Provenance-based Adaptive Scheduling Heuristic for Parallel Scientific Workflows in Clouds , 2012, Journal of Grid Computing.

[61]  Yolanda Gil,et al.  Pegasus: Mapping Scientific Workflows onto the Grid , 2004, European Across Grids Conference.