A Data-Aware Scheduling Strategy for Executing Large-Scale Distributed Workflows

Task scheduling is a crucial key component for the efficient execution of data-intensive applications on distributed environments, by which many machines must be coordinated to reduce execution times and bandwidth consumption. This paper presents ADAGE, a data-aware scheduler designed to efficiently execute data-intensive workflows in large-scale computers. The proposed scheduler is based on three key features: <inline-formula> <tex-math notation="LaTeX">$i$ </tex-math></inline-formula>) <italic>critical path analysis</italic>, for discovering the critical tasks of a workflow and reducing data transferring between nodes; <inline-formula> <tex-math notation="LaTeX">$ii$ </tex-math></inline-formula>) <italic>work giving</italic>, a new dynamic planning strategy for migrating tasks from overloaded to unloaded nodes; and <inline-formula> <tex-math notation="LaTeX">$iii$ </tex-math></inline-formula>) <italic>task replication</italic>, which executes task replicas on different nodes for improving both execution time and fault tolerance. Experiments performed on a distributed computing environment composed of up to 1,024 processing nodes show that ADAGE achieves better performances than existing scheduling systems, obtaining an average reduction of up to 66% in execution time.

[1]  James E. Kelley,et al.  Critical-Path Planning and Scheduling: Mathematical Basis , 1961 .

[2]  Rajesh Raman,et al.  The classads language , 2004 .

[3]  Domenico Talia,et al.  JS4Cloud: script‐based workflow programming for scalable data analysis on cloud platforms , 2015, Concurr. Comput. Pract. Exp..

[4]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[5]  Satoshi Matsuoka,et al.  Grid Datafarm Architecture for Petascale Data Intensive Computing , 2002, 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID'02).

[6]  Dana Petcu,et al.  Exascale Machines Require New Programming Paradigms and Runtimes , 2015, Supercomput. Front. Innov..

[7]  Mei-Hui Su,et al.  Characterization of scientific workflows , 2008, 2008 Third Workshop on Workflows in Support of Large-Scale Science.

[8]  Marta Mattoso,et al.  A Survey of Data-Intensive Scientific Workflow Management , 2015, Journal of Grid Computing.

[9]  Francisco Javier García Blas,et al.  A Novel Data-Centric Programming Model for Large-Scale Parallel Systems , 2019, Euro-Par Workshops.

[10]  Douglas Thain,et al.  Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..

[11]  MATRIX : MAny-Task computing execution fabRIc at eXascale , 2013 .

[12]  Jesús Carretero,et al.  A data‐aware scheduling strategy for workflow execution in clouds , 2017, Concurr. Comput. Pract. Exp..

[13]  Fang Dong,et al.  BAR: An Efficient Data Locality Driven Task Scheduling Algorithm for Cloud Computing , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[14]  Liang Hu,et al.  Implementing Data Aware Scheduling In Gfarm(R) Using LSF(TM) Scheduler plugin Mechanism , 2005, GCA.

[15]  Michael Lang,et al.  Optimizing load balancing and data-locality with data-aware scheduling , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[16]  Mehmet Balman,et al.  Stork data scheduler: mitigating the data bottleneck in e-Science , 2011, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[17]  Domenico Talia,et al.  Programming models and systems for Big Data analysis , 2019, Int. J. Parallel Emergent Distributed Syst..

[18]  Jesús Carretero,et al.  A hierarchical parallel storage system based on distributed memory for large scale systems , 2013, EuroMPI.

[19]  Jingwen Wang,et al.  Utopia: A load sharing facility for large, heterogeneous distributed computer systems , 1993, Softw. Pract. Exp..

[20]  Ke Wang,et al.  Albatross: An efficient cloud-enabled task scheduling and execution framework using distributed message queues , 2016, 2016 IEEE 12th International Conference on e-Science (e-Science).

[21]  Michael Lang,et al.  Load‐balanced and locality‐aware scheduling for data‐intensive workloads at extreme scales , 2016, Concurr. Comput. Pract. Exp..

[22]  Ke Wang,et al.  ZHT: A Light-Weight Reliable Persistent Dynamic Scalable Zero-Hop Distributed Hash Table , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[23]  Mehmet Balman,et al.  A new paradigm: Data-aware scheduling in grid computing , 2009, Future Gener. Comput. Syst..

[24]  Ion Stoica,et al.  The Power of Choice in Data-Aware Cluster Scheduling , 2014, OSDI.

[25]  Ke Wang,et al.  FaBRiQ: Leveraging Distributed Hash Tables towards Distributed Publish-Subscribe Message Queues , 2015, 2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC).

[26]  James E. Kelley,et al.  Critical-path planning and scheduling , 1899, IRE-AIEE-ACM '59 (Eastern).

[27]  Víctor Méndez Muñoz,et al.  A Critical Path File Location (CPFL) algorithm for data-aware multiworkflow scheduling on HPC clusters , 2017, Future Gener. Comput. Syst..

[28]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[29]  Domenico Talia,et al.  A Workflow Management System for Scalable Data Mining on Clouds , 2018, IEEE Transactions on Services Computing.

[30]  Ewa Deelman,et al.  WorkflowSim: A toolkit for simulating scientific workflows in distributed environments , 2012, 2012 IEEE 8th International Conference on E-Science.

[31]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..