A Novel Workflow-Level Data Placement Strategy for Data-Sharing Scientific Cloud Workflows

Cloud computing can provide a more cost-effective way to deploy scientific workflows than traditional distributed computing environments such as cluster and grid. Due to the large size of scientific datasets, data placement plays an important role in scientific cloud workflow systems for improving system performance and reducing data transfer cost. Traditional task-level data placement strategy only considers shared datasets within individual workflows to reduce data transfer cost. However, it is obvious that task-level strategy is not necessarily good enough for the situation of multiple workflows at the workflow level. In this paper, a novel workflow-level data placement model is constructed, which regards multiple workflows as a whole. Then, a two-stage data placement strategy is proposed which first pre-allocates initial datasets to proper datacenters during workflow build-time stage, and then dynamically distributes newly generated datasets to appropriate datacenters during runtime stage. Both stages use an efficient discrete particle swarm optimization algorithm to place flexible-location datasets. Comprehensive experiments demonstrate that our workflow-level data placement strategy can be more cost-effective than its task-level counterpart for data-sharing scientific cloud workflows.

[1]  Huang Chuanhe,et al.  An Effective Data Placement Strategy in Main-Memory Database Cluster , 2011, 2011 Second International Conference on Networking and Distributed Computing.

[2]  Bora Uçar,et al.  Integrated data placement and task assignment for scientific workflows in clouds , 2011, DIDC '11.

[3]  Russell C. Eberhart,et al.  A discrete binary version of the particle swarm algorithm , 1997, 1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and Simulation.

[4]  Shiyong Lu,et al.  A Service Framework for Scientific Workflow Management in the Cloud , 2015, IEEE Transactions on Services Computing.

[5]  Xiao Liu,et al.  A Revised Discrete Particle Swarm Optimization for Cloud Workflow Scheduling , 2010, 2010 International Conference on Computational Intelligence and Security.

[6]  Cui Li,et al.  A Data Placement Strategy for Data-Intensive Applications in Cloud , 2010 .

[7]  Rajkumar Buyya,et al.  A Particle Swarm Optimization-Based Heuristic for Scheduling Workflow Applications in Cloud Computing Environments , 2010, 2010 24th IEEE International Conference on Advanced Information Networking and Applications.

[8]  Xiao Liu,et al.  A data placement strategy in scientific cloud workflows , 2010, Future Gener. Comput. Syst..

[9]  Deven Desai Beyond location: data security in the 21st century , 2013, CACM.

[10]  Ritu Garg,et al.  Multi-objective Workflow Grid Scheduling Based on Discrete Particle Swarm Optimization , 2011, SEMCCO.

[11]  Jun Zhang,et al.  A novel discrete particle swarm optimization to solve traveling salesman problem , 2007, 2007 IEEE Congress on Evolutionary Computation.

[12]  Paolo Missier,et al.  Linking multiple workflow provenance traces for interoperable collaborative science , 2010, The 5th Workshop on Workflows in Support of Large-Scale Science.

[13]  Balinder Singh,et al.  A Systematic Review on Cloud Computing , 2013 .

[14]  Gregory R. Andrews,et al.  An adaptive approach to data placement , 1996, Proceedings of International Conference on Parallel Processing.

[15]  Jason Cope,et al.  Robust data placement in urgent computing environments , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[16]  Erzhou Zhu,et al.  A New Particle Swarm Optimization-Based Strategy for Cost-Effective Data Placement in Scientific Cloud Workflows , 2014 .

[17]  Yuhui Shi,et al.  Particle swarm optimization: developments, applications and resources , 2001, Proceedings of the 2001 Congress on Evolutionary Computation (IEEE Cat. No.01TH8546).

[18]  Xiao Liu,et al.  On-demand minimum cost benchmarking for intermediate dataset storage in scientific cloud workflow systems , 2011, J. Parallel Distributed Comput..

[19]  Christian Haas,et al.  A Social Compute Cloud: Allocating and Sharing Infrastructure Resources via Social Networks , 2014, IEEE Transactions on Services Computing.

[20]  Miron Livny,et al.  Data placement for scientific applications in distributed environments , 2007, 2007 8th IEEE/ACM International Conference on Grid Computing.

[21]  Depei Qian,et al.  A Study on Data Placement of Extensible Parallel Storage System , 2007, 6th IEEE/ACIS International Conference on Computer and Information Science (ICIS 2007).

[22]  Thomas Neumuth,et al.  Multi-perspective workflow modeling for online surgical situation models , 2015, J. Biomed. Informatics.

[23]  Kaijun Ren,et al.  A clustering based coscheduling strategy for efficient scientific workflow execution in cloud computing , 2013, Concurr. Comput. Pract. Exp..

[24]  Xiao Liu,et al.  Do we need to handle every temporal violation in scientific workflow systems? , 2014, TSEM.

[25]  Xin Liu,et al.  Towards Intelligent Data Placement for Scientific Workflows in Collaborative Cloud Environment , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[26]  Haruo Yokota,et al.  Adaptive overlapped declustering: a highly available data-placement method balancing access load and space utilization , 2005, 21st International Conference on Data Engineering (ICDE'05).

[27]  Miron Livny,et al.  Stork: making data placement a first class citizen in the grid , 2004, 24th International Conference on Distributed Computing Systems, 2004. Proceedings..

[28]  Shiyong Lu,et al.  Typetheoretic Approach to the Shimming Problem in Scientific Workflows , 2015, IEEE Transactions on Services Computing.

[29]  G. Bruce Berriman,et al.  Data Sharing Options for Scientific Workflows on Amazon EC2 , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[30]  Jia Zhang,et al.  Confucius: A Tool Supporting Collaborative Scientific Workflow Composition , 2014, IEEE Transactions on Services Computing.

[31]  G. Manoj Someswar,et al.  Time-Based Proxy Re-encryption Scheme for Secure Data Sharing in a Cloud Environment , 2015 .

[32]  Samir Khuller,et al.  Approximation algorithms for data placement on parallel disks , 2000, SODA '00.

[33]  James Kennedy,et al.  Particle swarm optimization , 2002, Proceedings of ICNN'95 - International Conference on Neural Networks.

[34]  Stavros Christodoulakis,et al.  Optimal Data Placement on Disks: A Comprehensive Solution for Different Technologies , 2000, IEEE Trans. Knowl. Data Eng..

[35]  Ewa Deelman,et al.  Peer-to-Peer Data Sharing for Scientific Workflows on Amazon EC2 , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[36]  Alejandro Díaz-Morcillo,et al.  Shaping an antenna radiation diagram with discrete weights using the PSO algorithm and periodic and nest spaces , 2009, Math. Comput. Model..

[37]  Chen Yi,et al.  A Data Placement Strategy Based on Genetic Algorithm for Scientific Workflows , 2012, 2012 Eighth International Conference on Computational Intelligence and Security.

[38]  Jianwu Wang,et al.  Big Data Applications Using Workflows for Data Parallel Computing , 2014, Computing in Science & Engineering.

[39]  Jin Li,et al.  A Hybrid Cloud Approach for Secure Authorized Deduplication , 2015, IEEE Transactions on Parallel and Distributed Systems.