A Greedy Approach with New Cost Model for Intermediate Datasets Storage Problem in General Workflows

Running a scientific workflow on the cloud will generate a large volume of intermediate datasets and many of them have valuable information that can be used for further study, but the cost of storing them all is unbelievably high for the enormous data size. A feasible solution is to keep some of the intermediate datasets stored and re-compute the others when needed, the intermediate dataset storage problem asks to find a tradeoff to minimize the total cost of storing or re-generating each of the intermediate datasets. This paper focuses on a new cost model for the problem with general workflow, which incorporates additional delay tolerance, usage frequency and the transfer cost to make the cost model becoming more general. Based on a directed acyclic graph describing the dependence relationship between datasets, a greedy approach for the problem is proposed and implemented. Experimental results demonstrate the effectiveness and efficiency of our algorithm.

[1]  Ewa Deelman,et al.  The application of cloud computing to scientific workflows: a study of cost and performance , 2013, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[2]  Yun Yang,et al.  An Algorithm for Finding the Minimum Cost of Storing and Regenerating Datasets in Multiple Clouds , 2018, IEEE Transactions on Cloud Computing.

[3]  Miron Livny,et al.  The cost of doing science on the cloud: The Montage example , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Luc Moreau,et al.  The Open Provenance Model: An Overview , 2008, IPAW.

[5]  Xiao Liu,et al.  On-demand minimum cost benchmarking for intermediate dataset storage in scientific cloud workflow systems , 2011, J. Parallel Distributed Comput..

[6]  Miron Livny,et al.  Data placement for scientific applications in distributed environments , 2007, 2007 8th IEEE/ACM International Conference on Grid Computing.

[7]  Mazani Manaf,et al.  Data provenance trusted model in cloud computing , 2013, 2013 International Conference on Research and Innovation in Information Systems (ICRIIS).

[8]  Matthew R. Pocock,et al.  Taverna: a tool for the composition and enactment of bioinformatics workflows , 2004, Bioinform..

[9]  Binhai Zhu,et al.  Improved algorithms for intermediate dataset storage in a cloud-based dataflow , 2017, Theor. Comput. Sci..

[10]  Xiao Liu,et al.  Concurrency and Computation: Practice and Experience a Data Dependency Based Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems ‡ , 2022 .

[11]  Xiao Liu,et al.  A cost-effective strategy for intermediate data storage in scientific cloud workflow systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[12]  Doug Johnson,et al.  Computing in the Clouds. , 2010 .

[13]  Yolanda Gil,et al.  Pegasus: Mapping Scientific Workflows onto the Grid , 2004, European Across Grids Conference.

[14]  Yong Zhao,et al.  Cloud Computing and Grid Computing 360-Degree Compared , 2008, GCE 2008.

[15]  Peter Z. Kunszt,et al.  Giggle: A Framework for Constructing Scalable Replica Location Services , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[16]  Fariborz Mousavi Madani,et al.  Virtual optical network embedding over elastic optical networks with set-up delay tolerance , 2015, 2015 23rd Iranian Conference on Electrical Engineering.

[17]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[18]  Ann L. Chervenak,et al.  Data Management Challenges of Data-Intensive Scientific Workflows , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[19]  Xiao Liu,et al.  A data placement strategy in scientific cloud workflows , 2010, Future Gener. Comput. Syst..