论文信息 - A data placement strategy in scientific cloud workflows

A data placement strategy in scientific cloud workflows

In scientific cloud workflows, large amounts of application data need to be stored in distributed data centres. To effectively store these data, a data manager must intelligently select data centres in which these data will reside. This is, however, not the case for data which must have a fixed location. When one task needs several datasets located in different data centres, the movement of large volumes of data becomes a challenge. In this paper, we propose a matrix based k-means clustering strategy for data placement in scientific cloud workflows. The strategy contains two algorithms that group the existing datasets in k data centres during the workflow build-time stage, and dynamically clusters newly generated datasets to the most appropriate data centres-based on dependencies-during the runtime stage. Simulations show that our algorithm can effectively reduce data movement during the workflow's execution.

[1] Daniel S. Katz,et al. Optimizing workflow data footprint , 2007, Sci. Program..

[2] Ian J. Taylor,et al. Workflows and e-Science: An overview of workflow system features and capabilities , 2009, Future Gener. Comput. Syst..

[3] Miron Livny,et al. A framework for reliable and efficient data placement in distributed computing systems , 2005, J. Parallel Distributed Comput..

[4] GhemawatSanjay,et al. The Google file system , 2003 .

[5] Miron Livny,et al. Stork: making data placement a first class citizen in the grid , 2004, 24th International Conference on Distributed Computing Systems, 2004. Proceedings..

[6] Paul J. Schweitzer,et al. Problem Decomposition and Data Reorganization by a Clustering Technique , 1972, Oper. Res..

[7] Miron Livny,et al. Data placement for scientific applications in distributed environments , 2007, 2007 8th IEEE/ACM International Conference on Grid Computing.

[8] Edward A. Lee,et al. Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[9] Reagan Moore,et al. The SDSC storage resource broker , 2010, CASCON.

[10] Rajkumar Buyya,et al. The Gridbus toolkit for service oriented grid and utility computing: an overview and status report , 2004, 1st IEEE International Workshop on Grid Economics and Business Models, 2004. GECON 2004..

[11] Gilles Fedak,et al. BitDew: A programmable environment for large-scale data management and distribution , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[12] Jason Maassen,et al. Programming Scientific and Distributed Workflow with Triana Services , 2004 .

[13] Babak Falsafi,et al. Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.

[14] Tao Xie,et al. SEA: A Striping-Based Energy-Aware Strategy for Data Placement in RAID-Structured Storage Systems , 2008, IEEE Transactions on Computers.

[15] Miron Livny,et al. The cost of doing science on the cloud: The Montage example , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[16] Randy H. Katz,et al. Above the Clouds: A Berkeley View of Cloud Computing , 2009 .

[17] Sudipto Guha,et al. Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[18] Matthew R. Pocock,et al. Taverna: a tool for the composition and enactment of bioinformatics workflows , 2004, Bioinform..

[19] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[20] Radu Prodan,et al. Scheduling of scientific workflows in the ASKALON grid environment , 2005, SGMD.

[21] Yun Yang,et al. SwinDeW-a p2p-based decentralized workflow management system , 2006, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[22] Dennis Gannon,et al. Scientific versus Business Workflows , 2007, Workflows for e-Science, Scientific Workflows for Grids.

[23] Rajkumar Buyya,et al. A grid service broker for scheduling distributed data-oriented applications on global grids , 2004, MGC '04.

[24] Xiao Liu,et al. An Algorithm in SwinDeW-C for Scheduling Transaction-Intensive Cost-Constrained Cloud Workflows , 2008, 2008 IEEE Fourth International Conference on eScience.

[25] Tim Kraska,et al. Building a database on S3 , 2008, SIGMOD Conference.

[26] José A. B. Fortes,et al. CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications , 2008, 2008 IEEE Fourth International Conference on eScience.

[27] Patrick Valduriez,et al. Principles of Distributed Database Systems , 1990 .

[28] Jason Cope,et al. Robust data placement in urgent computing environments , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[29] Adriana Iamnitchi,et al. File grouping for scientific data management: lessons from experimenting with real traces , 2008, HPDC '08.

[30] Renato Figueiredo,et al. Science Clouds: Early Experiences in Cloud Computing for Scientific Applications , 2008 .

[31] Huan Liu,et al. GridBatch: Cloud Computing for Large-Scale Data-Intensive Batch Applications , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[32] Yolanda Gil,et al. Pegasus: Mapping Scientific Workflows onto the Grid , 2004, European Across Grids Conference.

[33] G. Bruce Berriman,et al. On the Use of Cloud Computing for Scientific Workflows , 2008, 2008 IEEE Fourth International Conference on eScience.

[34] Yong Zhao,et al. Cloud Computing and Grid Computing 360-Degree Compared , 2008, GCE 2008.

[35] BuyyaRajkumar,et al. A taxonomy of Data Grids for distributed data sharing, management, and processing , 2006 .

[36] Ann L. Chervenak,et al. Data Management Challenges of Data-Intensive Scientific Workflows , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[37] Rajkumar Buyya,et al. Article in Press Future Generation Computer Systems ( ) – Future Generation Computer Systems Cloud Computing and Emerging It Platforms: Vision, Hype, and Reality for Delivering Computing as the 5th Utility , 2022 .

[38] Hai Jin,et al. Peer-to-Peer Based Grid Workflow Runtime Environment of SwinDeW-G , 2007, Third IEEE International Conference on e-Science and Grid Computing (e-Science 2007).

[39] Rajkumar Buyya,et al. A taxonomy of Data Grids for distributed data sharing, management, and processing , 2005, CSUR.

[40] Peter Z. Kunszt,et al. Giggle: A Framework for Constructing Scalable Replica Location Services , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[41] Douglas Thain,et al. All-pairs: An abstraction for data-intensive cloud computing , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[42] Srikumar Venugopal,et al. A Set Coverage-based Mapping Heuristic for Scheduling Distributed Data-Intensive Applications on Global Grids , 2006, 2006 7th IEEE/ACM International Conference on Grid Computing.

[43] Lizhe Wang,et al. Scientific Cloud Computing: Early Definition and Experience , 2008, 2008 10th IEEE International Conference on High Performance Computing and Communications.

[44] Anil K. Jain,et al. Data clustering: a review , 1999, CSUR.

[45] Robert L. Grossman,et al. Compute and storage clouds using wide area high performance networks , 2008, Future Gener. Comput. Syst..

[46] Robert L. Grossman,et al. Data mining using high performance data clouds: experimental studies using sector and sphere , 2008, KDD.

[47] Dennis Gannon,et al. Workflows for e-Science, Scientific Workflows for Grids , 2014 .

[48] Roger Smith,et al. Computing in the Cloud , 2009 .

[49] Radu Prodan,et al. Overhead Analysis of Scientific Workflows in Grid Environments , 2008, IEEE Transactions on Parallel and Distributed Systems.