A Cleanup Algorithm for Implementing Storage Constraints in Scientific Workflow Executions

Scientific workflows are often used to automate large-scale data analysis pipelines on clusters, grids, and clouds. However, because workflows can be extremely data-intensive, and are often executed on shared resources, it is critical to be able to limit or minimize the amount of disk space that workflows use on shared storage systems. This paper proposes a novel and simple approach that constrains the amount of storage space used by a workflow by inserting data cleanup tasks into the workflow task graph. Unlike previous solutions, the proposed approach provides guaranteed limits on disk usage, requires no new functionality in the underlying workflow scheduler, and does not require estimates of task runtimes. Experimental results show that this algorithm significantly reduces the number of cleanup tasks added to a workflow and yields better workflow makespans than the strategy currently used by the Pegasus Workflow Management System.

[1]  Mark Greenwood,et al.  Taverna: lessons in creating a workflow environment for the life sciences: Research Articles , 2006 .

[2]  Hong Linh Truong,et al.  ASKALON: a tool set for cluster and Grid computing: Research Articles , 2005 .

[3]  Daniel S. Katz,et al.  Optimizing workflow data footprint , 2007, Sci. Program..

[4]  Dennis Gannon,et al.  Workflows for e-Science, Scientific Workflows for Grids , 2014 .

[5]  Jeff Weber,et al.  Workflow Management in Condor , 2007, Workflows for e-Science, Scientific Workflows for Grids.

[6]  Ewa Deelman,et al.  Partitioning and Scheduling Workflows across Multiple Sites with Storage Constraints , 2011, PPAM.

[7]  Moustafa Ghanem,et al.  Tavaxy: Integrating Taverna and Galaxy workflows with cloud computing support , 2012, BMC Bioinformatics.

[8]  Rizos Sakellariou,et al.  Scheduling Data-IntensiveWorkflows onto Storage-Constrained Distributed Resources , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[9]  Ann L. Chervenak,et al.  Scheduling data-intensive workflows on storage constrained resources , 2009, WORKS '09.

[10]  Edward A. Lee,et al.  CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2000; 00:1–7 Prepared using cpeauth.cls [Version: 2002/09/19 v2.02] Taverna: Lessons in creating , 2022 .

[11]  Daniel S. Katz,et al.  Montage: a grid-enabled engine for delivering custom science-grade mosaics on demand , 2004, SPIE Astronomical Telescopes + Instrumentation.

[12]  Radu Prodan,et al.  ASKALON: a tool set for cluster and Grid computing , 2005, Concurr. Pract. Exp..

[13]  C. Kesselman,et al.  CyberShake: A Physics-Based Seismic Hazard Model for Southern California , 2011 .

[14]  Rajkumar Buyya,et al.  CloudSim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms , 2011, Softw. Pract. Exp..

[15]  Miron Livny,et al.  Pegasus, a workflow management system for science automation , 2015, Future Gener. Comput. Syst..

[16]  Douglas Thain,et al.  Toward fine-grained online task characteristics estimation in scientific workflows , 2013, WORKS@SC.

[17]  Ewa Deelman,et al.  Workflow overhead analysis and optimizations , 2011, WORKS '11.

[18]  Ann L. Chervenak,et al.  Characterizing and profiling scientific workflows , 2013, Future Gener. Comput. Syst..

[19]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..