A Task Pipelining Framework for e-Science Workflow Management Systems

Workflow manager is a useful tool that brings the power of computational grid resources to the desktop, and allow them to conveniently put together and run their own scientific workflows. In existing workflow systems, individual tasks wait for input to be available perform computation, and produce output. Behind this, workflow manager automates the data movement from the data generating task to the data consumption task. This process is referred as file staging. Generally, stage-in, process, and stage-out are serially executed and staging is treated by traditional work- flow systems as a trivial step. However, as the data size is exponentially increasing and more and more scientific workflows require multiple processing steps to obtain the desired output, we argue that the data movement will possess high portion of overall running time and staging will become a challenging step of scientific workflow systems. In this paper, we propose a task pipelining framework for various e-Science workflow systems. Our system is a flexible and efficient tool to help the workflow systems to overlap the execution of adjacent tasks by enabling the pipelining of the intermediate data transfer between the interconnected tasks.

[1]  Rajkumar Buyya,et al.  A Taxonomy of Workflow Management Systems for Grid Computing , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[2]  G. Alonso,et al.  Parallel computing patterns for Grid workflows , 2006, 2006 Workshop on Workflows in Support of Large-Scale Science.

[3]  Reagan Moore,et al.  The SDSC storage resource broker , 2010, CASCON.

[4]  Kavitha Ranganathan,et al.  Decoupling computation and data scheduling in distributed data-intensive applications , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[5]  GhemawatSanjay,et al.  The Google file system , 2003 .

[6]  Scott Klasky,et al.  High performance threaded data streaming for large scale simulations , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[7]  Louis O. Hertzberger,et al.  VLAM-G: Interactive data driven workflow engine for Grid-enabled resources , 2007, Sci. Program..

[8]  Rajkumar Buyya,et al.  A novel architecture for realizing grid workflow using tuple spaces , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[9]  Ian Taylor,et al.  Resource management for the Triana peer-to-peer services , 2004 .

[10]  Subhash Saini,et al.  GridFlow: workflow management for grid computing , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[11]  Radu Prodan,et al.  ASKALON: a tool set for cluster and Grid computing , 2005, Concurr. Pract. Exp..

[12]  Nathalie Furmento,et al.  ICENI Dataflow and Workflow: Composition and Scheduling in Space and Time , 2003 .

[13]  Miron Livny,et al.  Condor: a distributed job scheduler , 2001 .

[14]  Gregor von Laszewski Java CoG Kit Workflow Concepts for Scientific Experiments , 2005 .

[15]  Ivan Janciak,et al.  UK e-Science All Hands Meeting , 2009 .

[16]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[17]  Jonathan D. Blower,et al.  Data streaming, workflow and firewall-friendly Grid Services with Styx , 2005 .

[18]  Shin Gyu Kim,et al.  HVEM Control System Based on Grid: A Cornerstone of e-Biology , 2007, International Conference on Computational Science.

[19]  Heon Young Yeom,et al.  HVEM DataGrid: Implementation of a Biologic Data Management System for Experiments with High Voltage Electron Microscope , 2006, GCCB.

[20]  David Abramson,et al.  A flexible IO scheme for grid workflows , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[21]  Ali Afzal,et al.  Workflow Enactment in ICENI , 2004 .

[22]  Shawn Bowers,et al.  An approach for pipelining nested collections in scientific workflows , 2005, SGMD.