Towards optimized scheduling for data‐intensive scientific workflow in multiple datacenter environment

In the big data era, scientific workflow exhibits the characteristics of data intensity and becomes increasingly popular in scientific domains. Efficient scheduling of data‐intensive scientific workflow in a multiple datacenter (DC) environment has been a long‐standing challenge. Most of previous work on data‐intensive scientific workflow scheduling primarily focused on the optimization of reducing the volumes of data transfer between workflow tasks. In this paper, novel scheduling strategies for the execution of data‐intensive scientific workflow in multi‐DC environment are proposed aiming at the optimization of the overall data transfer time. A novel DC selection approach is proposed to minimize the number of DCs having enough storage capacity for the execution of scientific workflow as well as optimized inter‐DC network bandwidth for efficient data transfer between workflow tasks. A k‐means clustering‐based data placement strategy is adopted to intelligently place the initial data of scientific workflow thereby reducing the volume of initial data transfer between different DCs. A multilevel task replication scheduling strategy is invented to reduce the volumes of intermediate data transfer between DCs during the runtime of the scientific workflow. Simulations spanning a broad range of scientific workflow and multi‐DC settings are performed in order to verify the proposed approaches. The numerical results show that our combined scheduling strategy significantly reduces the overall data transfer time and data transfer volume when scientific workflow is scheduled in multi‐DC environment. Copyright © 2015 John Wiley & Sons, Ltd.

[1]  Fang Dong,et al.  Scheduling of scientific workflow in non-dedicated heterogeneous multicluster platform , 2013, J. Syst. Softw..

[2]  Vijay K. Gurbani,et al.  Network-aware service placement in a distributed cloud environment , 2012, SIGCOMM '12.

[3]  Peng Zhang,et al.  Collaborative network security in multi-tenant data center for cloud computing , 2014 .

[4]  J. F. Aguilar Madeira,et al.  Multi-objective optimization of structures topology by genetic algorithms , 2005, Adv. Eng. Softw..

[5]  Xiao Liu,et al.  On-demand minimum cost benchmarking for intermediate dataset storage in scientific cloud workflow systems , 2011, J. Parallel Distributed Comput..

[6]  Nazareno Andrade,et al.  Labs of the World, Unite!!! , 2006, Journal of Grid Computing.

[7]  Xiao Liu,et al.  A data placement strategy in scientific cloud workflows , 2010, Future Gener. Comput. Syst..

[8]  Tao Xie,et al.  SEA: A Striping-Based Energy-Aware Strategy for Data Placement in RAID-Structured Storage Systems , 2008, IEEE Transactions on Computers.

[9]  Bora Uçar,et al.  Integrated data placement and task assignment for scientific workflows in clouds , 2011, DIDC '11.

[10]  Marios Hadjieleftheriou,et al.  Distributed data placement to minimize communication costs via graph partitioning , 2014, SSDBM '14.

[11]  Lavanya Ramakrishnan,et al.  VGrADS: enabling e-Science workflows on grids and clouds with fault tolerance , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[12]  Xiao Liu,et al.  A market-oriented hierarchical scheduling strategy in cloud workflow systems , 2011, The Journal of Supercomputing.

[13]  Salim Hariri,et al.  Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[14]  Ann L. Chervenak,et al.  Scheduling data-intensive workflows on storage constrained resources , 2009, WORKS '09.

[15]  Kaijun Ren,et al.  A clustering based coscheduling strategy for efficient scientific workflow execution in cloud computing , 2013, Concurr. Comput. Pract. Exp..

[16]  Scott D. Kahn On the Future of Genomic Data , 2011, Science.

[17]  David W. Coit,et al.  Multi-objective optimization using genetic algorithms: A tutorial , 2006, Reliab. Eng. Syst. Saf..

[18]  Rajiv Ranjan,et al.  G-Hadoop: MapReduce across distributed data centers for data-intensive computing , 2013, Future Gener. Comput. Syst..

[19]  Tevfik Kosar Data Intensive Distributed Computing: Challenges and Solutions for Large-scale Information Management , 2012 .

[20]  Marta Mattoso,et al.  Scientific Workflow Partitioning in Multisite Cloud , 2014, Euro-Par Workshops.

[21]  Alexander L. Stolyar,et al.  Shadow-Routing Based Dynamic Algorithms for Virtual Machine Placement in a Network Cloud , 2013, IEEE Transactions on Cloud Computing.

[22]  A. Curry,et al.  Rescue of old data offers lesson for particle physicists. , 2011, Science.

[23]  Marta Mattoso,et al.  A Survey of Data-Intensive Scientific Workflow Management , 2015, Journal of Grid Computing.

[24]  Babak Falsafi,et al.  Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.

[25]  Francisco Vilar Brasileiro,et al.  Trading Cycles for Information: Using Replication to Schedule Bag-of-Tasks Applications on Computational Grids , 2003, Euro-Par.