A data placement strategy in scientific cloud workflows

In scientific cloud workflows, large amounts of application data need to be stored in distributed data centres. To effectively store these data, a data manager must intelligently select data centres in which these data will reside. This is, however, not the case for data which must have a fixed location. When one task needs several datasets located in different data centres, the movement of large volumes of data becomes a challenge. In this paper, we propose a matrix based k-means clustering strategy for data placement in scientific cloud workflows. The strategy contains two algorithms that group the existing datasets in k data centres during the workflow build-time stage, and dynamically clusters newly generated datasets to the most appropriate data centres-based on dependencies-during the runtime stage. Simulations show that our algorithm can effectively reduce data movement during the workflow's execution.

[1]  Daniel S. Katz,et al.  Optimizing workflow data footprint , 2007, Sci. Program..

[2]  Ian J. Taylor,et al.  Workflows and e-Science: An overview of workflow system features and capabilities , 2009, Future Gener. Comput. Syst..

[3]  Miron Livny,et al.  A framework for reliable and efficient data placement in distributed computing systems , 2005, J. Parallel Distributed Comput..

[4]  GhemawatSanjay,et al.  The Google file system , 2003 .

[5]  Miron Livny,et al.  Stork: making data placement a first class citizen in the grid , 2004, 24th International Conference on Distributed Computing Systems, 2004. Proceedings..

[6]  Paul J. Schweitzer,et al.  Problem Decomposition and Data Reorganization by a Clustering Technique , 1972, Oper. Res..

[7]  Miron Livny,et al.  Data placement for scientific applications in distributed environments , 2007, 2007 8th IEEE/ACM International Conference on Grid Computing.

[8]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[9]  Reagan Moore,et al.  The SDSC storage resource broker , 2010, CASCON.

[10]  Rajkumar Buyya,et al.  The Gridbus toolkit for service oriented grid and utility computing: an overview and status report , 2004, 1st IEEE International Workshop on Grid Economics and Business Models, 2004. GECON 2004..

[11]  Gilles Fedak,et al.  BitDew: A programmable environment for large-scale data management and distribution , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Jason Maassen,et al.  Programming Scientific and Distributed Workflow with Triana Services , 2004 .

[13]  Babak Falsafi,et al.  Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.

[14]  Tao Xie,et al.  SEA: A Striping-Based Energy-Aware Strategy for Data Placement in RAID-Structured Storage Systems , 2008, IEEE Transactions on Computers.

[15]  Miron Livny,et al.  The cost of doing science on the cloud: The Montage example , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Randy H. Katz,et al.  Above the Clouds: A Berkeley View of Cloud Computing , 2009 .

[17]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[18]  Matthew R. Pocock,et al.  Taverna: a tool for the composition and enactment of bioinformatics workflows , 2004, Bioinform..

[19]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[20]  Radu Prodan,et al.  Scheduling of scientific workflows in the ASKALON grid environment , 2005, SGMD.

[21]  Yun Yang,et al.  SwinDeW-a p2p-based decentralized workflow management system , 2006, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[22]  Dennis Gannon,et al.  Scientific versus Business Workflows , 2007, Workflows for e-Science, Scientific Workflows for Grids.

[23]  Rajkumar Buyya,et al.  A grid service broker for scheduling distributed data-oriented applications on global grids , 2004, MGC '04.

[24]  Xiao Liu,et al.  An Algorithm in SwinDeW-C for Scheduling Transaction-Intensive Cost-Constrained Cloud Workflows , 2008, 2008 IEEE Fourth International Conference on eScience.

[25]  Tim Kraska,et al.  Building a database on S3 , 2008, SIGMOD Conference.

[26]  José A. B. Fortes,et al.  CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications , 2008, 2008 IEEE Fourth International Conference on eScience.

[27]  Patrick Valduriez,et al.  Principles of Distributed Database Systems , 1990 .

[28]  Jason Cope,et al.  Robust data placement in urgent computing environments , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[29]  Adriana Iamnitchi,et al.  File grouping for scientific data management: lessons from experimenting with real traces , 2008, HPDC '08.

[30]  Renato Figueiredo,et al.  Science Clouds: Early Experiences in Cloud Computing for Scientific Applications , 2008 .

[31]  Huan Liu,et al.  GridBatch: Cloud Computing for Large-Scale Data-Intensive Batch Applications , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[32]  Yolanda Gil,et al.  Pegasus: Mapping Scientific Workflows onto the Grid , 2004, European Across Grids Conference.

[33]  G. Bruce Berriman,et al.  On the Use of Cloud Computing for Scientific Workflows , 2008, 2008 IEEE Fourth International Conference on eScience.

[34]  Yong Zhao,et al.  Cloud Computing and Grid Computing 360-Degree Compared , 2008, GCE 2008.

[35]  BuyyaRajkumar,et al.  A taxonomy of Data Grids for distributed data sharing, management, and processing , 2006 .

[36]  Ann L. Chervenak,et al.  Data Management Challenges of Data-Intensive Scientific Workflows , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[37]  Rajkumar Buyya,et al.  Article in Press Future Generation Computer Systems ( ) – Future Generation Computer Systems Cloud Computing and Emerging It Platforms: Vision, Hype, and Reality for Delivering Computing as the 5th Utility , 2022 .

[38]  Hai Jin,et al.  Peer-to-Peer Based Grid Workflow Runtime Environment of SwinDeW-G , 2007, Third IEEE International Conference on e-Science and Grid Computing (e-Science 2007).

[39]  Rajkumar Buyya,et al.  A taxonomy of Data Grids for distributed data sharing, management, and processing , 2005, CSUR.

[40]  Peter Z. Kunszt,et al.  Giggle: A Framework for Constructing Scalable Replica Location Services , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[41]  Douglas Thain,et al.  All-pairs: An abstraction for data-intensive cloud computing , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[42]  Srikumar Venugopal,et al.  A Set Coverage-based Mapping Heuristic for Scheduling Distributed Data-Intensive Applications on Global Grids , 2006, 2006 7th IEEE/ACM International Conference on Grid Computing.

[43]  Lizhe Wang,et al.  Scientific Cloud Computing: Early Definition and Experience , 2008, 2008 10th IEEE International Conference on High Performance Computing and Communications.

[44]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[45]  Robert L. Grossman,et al.  Compute and storage clouds using wide area high performance networks , 2008, Future Gener. Comput. Syst..

[46]  Robert L. Grossman,et al.  Data mining using high performance data clouds: experimental studies using sector and sphere , 2008, KDD.

[47]  Dennis Gannon,et al.  Workflows for e-Science, Scientific Workflows for Grids , 2014 .

[48]  Roger Smith,et al.  Computing in the Cloud , 2009 .

[49]  Radu Prodan,et al.  Overhead Analysis of Scientific Workflows in Grid Environments , 2008, IEEE Transactions on Parallel and Distributed Systems.