Clustering Based on Task Dependency for Data-Intensive Workflow Scheduling Optimization

Scientists in each experiment team share their data and use distributed resources for conducting their experiments. These experiments are being accompanied in collaboration with teams that are globally dispersed. Scientific data need to be replicated or cached at distributed locations around the world. Data locality problem and transferred data overhead are important challenges for scheduling such data-intensive scientific workflow application in cloud computing. These applications are leading to the era of big data and task execution involves consuming and producing huge amount of input/output data with data dependencies among tasks. Scheduling and execution overhead are high when low performance of fine-grained tasks is a common problem in widely distributed platforms. Clustering Method based Task Dependency (CMTD) to reduce execution overhead and to improve the computational granularity of scientific workflow tasks is presented in this paper. And this paper proposes the data-intensive workflow scheduling system to minimize makespan of the data-intensive workflow applications, which can be modeled as a directed acyclic graph. Clustering method is validated by using simulation based analysis though WorkflowSim.

[1]  Quan Z. Sheng,et al.  Science in the Cloud: Allocation and Execution of Data-Intensive Scientific Workflows , 2013, Journal of Grid Computing.

[2]  Junzhou Luo,et al.  Data Placement and Task Scheduling Optimization for Data Intensive Scientific Workflow in Multiple Data Centers Environment , 2014 .

[3]  Chee Sun Liew,et al.  Data-Intensive Workflow Optimization Based on Application Task Graph Partitioning in Heterogeneous Computing Systems , 2014, 2014 IEEE Fourth International Conference on Big Data and Cloud Computing.

[4]  Michael Lang,et al.  Load‐balanced and locality‐aware scheduling for data‐intensive workloads at extreme scales , 2016, Concurr. Comput. Pract. Exp..

[5]  Jignesh Lakhani,et al.  Scheduling technique of data intensive application workflows in Cloud computing , 2012, 2012 Nirma University International Conference on Engineering (NUiCONE).

[6]  Xiao Liu,et al.  A data placement strategy in scientific cloud workflows , 2010, Future Gener. Comput. Syst..

[7]  Miron Livny,et al.  Pegasus, a workflow management system for science automation , 2015, Future Gener. Comput. Syst..

[8]  Ann L. Chervenak,et al.  Characterizing and profiling scientific workflows , 2013, Future Gener. Comput. Syst..

[9]  Qingbo Wu,et al.  Workflow scheduling in cloud: a survey , 2015, The Journal of Supercomputing.

[10]  Alexandru Iosup,et al.  Workflow Monitoring and Analysis Tool for ASKALON , 2008, CoreGRID Workshop on Grid Middleware.

[11]  Ewa Deelman,et al.  Storage-aware Algorithms for Scheduling of Workflow Ensembles in Clouds , 2015, Journal of Grid Computing.

[12]  Sang-Min Park,et al.  Data throttling for data-intensive workflows , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[13]  Ewa Deelman,et al.  WorkflowSim: A toolkit for simulating scientific workflows in distributed environments , 2012, 2012 IEEE 8th International Conference on E-Science.