Pegasus: A framework for mapping complex scientific workflows onto distributed systems

This paper describes the Pegasus framework that can be used to map complex scientific workflows onto distributed resources. Pegasus enables users to represent the workflows at an abstract level without needing to worry about the particulars of the target execution systems. The paper describes general issues in mapping applications and the functionality of Pegasus. We present the results of improving application performance through workflow restructuring which clusters multiple tasks in a workflow into single entities. A real-life astronomy application is used as the basis for the study.

[1]  C. Kesselman,et al.  A Metadata Catalog Service for Data Intensive Applications , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[2]  Ian T. Foster,et al.  Condor-G: A Computation Management Agent for Multi-Institutional Grids , 2004, Cluster Computing.

[3]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.

[4]  Warren Smith,et al.  A Resource Management Architecture for Metacomputing Systems , 1998, JSSPP.

[5]  Adam Arbree,et al.  Mapping Abstract Complex Workflows onto Grid Environments , 2003, Journal of Grid Computing.

[6]  Xingfu Wu,et al.  Using kernel couplings to predict parallel application performance , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[7]  Carl Kesselman,et al.  GriPhyN and LIGO, building a virtual data Grid for gravitational wave scientists , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[8]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[9]  Ian T. Foster,et al.  Secure, Efficient Data Transport and Replica Management for High-Performance Data-Intensive Computing , 2001, 2001 Eighteenth IEEE Symposium on Mass Storage Systems and Technologies.

[10]  Ewa Deelman,et al.  Transformation Catalog Design for GriPhyN , 2001 .

[11]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..

[12]  Marios D. Dikaiakos Grid computing : Second European Across Grids Conference, AxGrids 2004, Nicosia, Cyprus, January 28-30, 2004 : revised papers , 2004 .

[13]  Daniel S. Katz,et al.  Montage: a grid-enabled engine for delivering custom science-grade mosaics on demand , 2004, SPIE Astronomical Telescopes + Instrumentation.

[14]  Marc Spraragen,et al.  An intelligent assistant for interactive workflow composition , 2004, IUI '04.

[15]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[16]  Y. Gil,et al.  A Knowledge-Based Approach to Interactive Workflow Composition , 2004 .

[17]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[18]  Yolanda Gil,et al.  Pegasus: Mapping Scientific Workflows onto the Grid , 2004, European Across Grids Conference.

[19]  Ian T. Foster,et al.  Grid information services for distributed resource sharing , 2001, Proceedings 10th IEEE International Symposium on High Performance Distributed Computing.

[20]  I. Deary,et al.  GLOBUS , 1989, The Lancet.

[21]  Geoffrey C. Fox,et al.  WebFlow - High-Level Programming Environment and Visual Authoring Toolkit for High Performance Distributed Computing , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[22]  Mary K. Vernon,et al.  Poems: end-to-end performance design of large parallel adaptive computational systems , 1998, WOSP '98.

[23]  Yolanda Gil,et al.  Workflow Management in GriPhyN" in Grid Resource Management J , 2003 .

[24]  Yolanda Gil,et al.  Workflow management in GriPhyN , 2004 .

[25]  Yong Zhao,et al.  Chimera: a virtual data system for representing, querying, and automating data derivation , 2002, Proceedings 14th International Conference on Scientific and Statistical Database Management.

[26]  Carl Kesselman,et al.  Grid-based metadata services , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[27]  Gregor von Laszewski,et al.  A Java commodity grid kit , 2001, Concurr. Comput. Pract. Exp..

[28]  Yannis Manolopoulos,et al.  Proceedings of the 16th International Conference on Scientific and Statistical Database Management (SSDBM 2004), 21-23 June 2004, Santorini Island, Greece , 2004, SSDBM.

[29]  David Abramson,et al.  Nimrod/G: an architecture for a resource management and scheduling system in a global computational grid , 2000, Proceedings Fourth International Conference/Exhibition on High Performance Computing in the Asia-Pacific Region.

[30]  Ian T. Foster,et al.  Security for Grid services , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.

[31]  Yolanda Gil,et al.  Pegasus and the Pulsar Search: From Metadata to Execution on the Grid , 2003, PPAM.

[32]  Mary K. Vernon,et al.  Predictive analysis of a wavefront application using LogGP , 1999, PPoPP '99.

[33]  Peter Z. Kunszt,et al.  Giggle: A Framework for Constructing Scalable Replica Location Services , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[34]  Rajkumar Buyya,et al.  Architectural Models for Resource Management in the Grid , 2000, GRID.

[35]  Zhou Lei,et al.  The portable batch scheduler and the maui scheduler on linux clusters , 2000 .

[36]  Ian Foster,et al.  The Globus toolkit , 1998 .

[37]  Subhash Saini,et al.  GridFlow: workflow management for grid computing , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..