Karma2: Provenance Management for Data-Driven Workflows

The increasing ability for the sciences to sense the world around us is resulting in a growing need for datadriven e-Science applications that are under the control of workflows composed of services on the Grid. The focus of our work is on provenance collection for these workflows that are necessary to validate the workflow and to determine quality of generated data products. The challenge we address is to record uniform and usable provenance metadata that meets the domain needs while minimizing the modification burden on the service authors and the performance overhead on the workflow engine and the services. The framework is based on generating discrete provenance activities during the lifecycle of a workflow execution that can be aggregated to form complex data and process provenance graphs that can span across workflows. The implementation uses a loosely coupled publish-subscribe architecture for propagating these activities, and the capabilities of the system satisfy the needs of detailed provenance collection. A performance evaluation of a prototype finds a minimal performance overhead (in the range of 1% for an eight-service workflow using 271 data products).

[1]  Yi Huang,et al.  Building web services for scientific grid applications , 2006, IBM J. Res. Dev..

[2]  Rahul Ramachandran,et al.  Service-oriented environments for dynamically interacting with mesoscale weather , 2005, Computing in Science & Engineering.

[3]  Jan Mendling Business Process Execution Language for Web Service (BPEL) , 2006 .

[4]  Anne-Marie Kermarrec,et al.  The many faces of publish/subscribe , 2003, CSUR.

[5]  Yogesh L. Simmhan,et al.  A survey of data provenance in e-science , 2005, SGMD.

[6]  Yi Huang,et al.  WS-Messenger: a Web services-based messaging system for service-oriented grid computing , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[7]  Paul T. Groth,et al.  The requirements of recording and using provenance in e- Science experiments , 2005 .

[8]  Luc Moreau,et al.  Report on the International Provenance and Annotation Workshop: (IPAW'06) 3-5 May 2006, Chicago , 2006, SGMD.

[9]  Yogesh L. Simmhan,et al.  Query capabilities of the Karma provenance framework , 2008, Concurr. Comput. Pract. Exp..

[10]  Daniel A. Reed,et al.  SvPablo: A multi-language architecture-independent performance analysis system , 1999, Proceedings of the 1999 International Conference on Parallel Processing.

[11]  Susan B. Davidson,et al.  Addressing the provenance challenge using ZOOM , 2008, Concurr. Comput. Pract. Exp..

[12]  Luc Moreau,et al.  Recording and Reasoning over Data Provenance in Web and Grid Services , 2003, OTM.

[13]  Yogesh L. Simmhan,et al.  Performance Evaluation of the Karma Provenance Framework for Scientific Workflows , 2006, IPAW.

[14]  Luc Moreau,et al.  Provenance of e-Science Experiments - Experience from Bioinformatics , 2003 .

[15]  Rajkumar Buyya,et al.  A taxonomy of scientific workflow systems for grid computing , 2005, SGMD.

[16]  Luc Moreau,et al.  Provenance and Annotation of Data, International Provenance and Annotation Workshop, IPAW 2006, Chicago, IL, USA, May 3-5, 2006, Revised Selected Papers , 2006, IPAW.

[17]  Steven Tuecke,et al.  The Physiology of the Grid An Open Grid Services Architecture for Distributed Systems Integration , 2002 .

[18]  Frank Z. Wang,et al.  Handbook of Research on Grid Technologies and Utility Computing: Concepts for Managing Large-Scale Applications , 2009 .

[19]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[20]  James Frew,et al.  Lineage retrieval for scientific data processing: a survey , 2005, CSUR.

[21]  Jason Lee,et al.  Dynamic monitoring of high-performance distributed applications , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[22]  Kuo-Chan Huang,et al.  Adaptive Processor Allocation for Moldable Jobs in Computational Grid , 2009, Int. J. Grid High Perform. Comput..

[23]  Jack Dongarra,et al.  Handbook of Research on Scalable Computing Technologies , 2009 .

[24]  Ming Wu,et al.  Quality of Service of Grid Computing , 2009 .

[25]  Carole A. Goble,et al.  Using Semantic Web Technologies for Representing E-science Provenance , 2004, SEMWEB.

[26]  Krista West Scoping out the planet. , 2005, Scientific American.

[27]  Geoffrey C. Fox,et al.  On the costs for reliable messaging in Web/grid service environments , 2005, First International Conference on e-Science and Grid Computing (e-Science'05).

[28]  Scott R. Kohn,et al.  Toward a Common Component Architecture for High-Performance Scientific Computing , 1999, HPDC.

[29]  Michael Luck,et al.  A Protocol for Recording Provenance in Service-Oriented Grids , 2004, OPODIS.

[30]  Paul T. Groth,et al.  Recording and using provenance in a protein compressibility experiment , 2005, HPDC-14. Proceedings. 14th IEEE International Symposium on High Performance Distributed Computing, 2005..

[31]  Ghaleb Abdulla,et al.  Scaling Up Data-Centric Middleware on a Cluster Computer , 2005 .

[32]  Jeffrey S. Vetter,et al.  Autopilot: adaptive control of distributed applications , 1998, Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244).

[33]  Jason Maassen,et al.  Programming Scientific and Distributed Workflow with Triana Services , 2004 .

[34]  Yogesh L. Simmhan,et al.  Towards a Quality Model for Effective Data Selection in Collaboratories , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[35]  James Frew,et al.  Composing lineage metadata with XML for custom satellite-derived data products , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[36]  Yogesh L. Simmhan,et al.  Service Oriented Architectures for Science Gateways on Grid Systems , 2005, ICSOC.

[37]  FosterIan,et al.  Report on the International Provenance and Annotation Workshop , 2006 .

[38]  U. Brandes,et al.  GraphML Progress Report ? Structural Layer Proposal , 2001 .

[39]  Luis Felipe Cabrera Web Services Eventing (WS-Eventing) , 2004 .

[40]  Ian T. Foster,et al.  The virtual data grid: a new model and architecture for data-intensive collaboration , 2003, 15th International Conference on Scientific and Statistical Database Management, 2003..

[41]  Yogesh L. Simmhan,et al.  Data Management in Dynamic Environment-driven Computational Science , 2007, Grid-Based Problem Solving Environments.