Data Placement and Task Scheduling Optimization for Data Intensive Scientific Workflow in Multiple Data Centers Environment

Running data-intensive scientific workflow across multiple data centers faces massive data transfer problem which leads to low efficiency in actual workflow application for scientists. By considering data size and data dependency, we propose a k-means algorithm based initial data placement strategy that places the most related initial data sets into the same data center at workflow preparation stage. During the execution of scientific workflow, by analyzing interdependent relationship between data sets and tasks, we adopt multilevel task replication strategy to reduce volume of intermediate data transfer. The simulation results show that the proposed strategies can effectively reduce data transfer among data centers and improve performance of running data intensive scientific workflows.

[1]  Lavanya Ramakrishnan,et al.  VGrADS: enabling e-Science workflows on grids and clouds with fault tolerance , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[2]  Tevfik Kosar Data Intensive Distributed Computing: Challenges and Solutions for Large-scale Information Management , 2012 .

[3]  Xiao Liu,et al.  A market-oriented hierarchical scheduling strategy in cloud workflow systems , 2011, The Journal of Supercomputing.

[4]  Nazareno Andrade,et al.  Labs of the World, Unite!!! , 2006, Journal of Grid Computing.

[5]  Qiang Zhang,et al.  The Characteristics of Cloud Computing , 2010, 2010 39th International Conference on Parallel Processing Workshops.

[6]  Xiao Liu,et al.  On-demand minimum cost benchmarking for intermediate dataset storage in scientific cloud workflow systems , 2011, J. Parallel Distributed Comput..

[7]  Xiao Liu,et al.  A data placement strategy in scientific cloud workflows , 2010, Future Gener. Comput. Syst..

[8]  Roger Smith,et al.  Computing in the Cloud , 2009 .

[9]  Tao Xie,et al.  SEA: A Striping-Based Energy-Aware Strategy for Data Placement in RAID-Structured Storage Systems , 2008, IEEE Transactions on Computers.

[10]  Babak Falsafi,et al.  Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.

[11]  Francisco Vilar Brasileiro,et al.  Trading Cycles for Information: Using Replication to Schedule Bag-of-Tasks Applications on Computational Grids , 2003, Euro-Par.

[12]  Rajiv Ranjan,et al.  G-Hadoop: MapReduce across distributed data centers for data-intensive computing , 2013, Future Gener. Comput. Syst..

[13]  Kaijun Ren,et al.  A clustering based coscheduling strategy for efficient scientific workflow execution in cloud computing , 2013, Concurr. Comput. Pract. Exp..

[14]  Bertram Ludäscher,et al.  Scientific workflow management and the Kepler system: Research Articles , 2006 .