Exploring Provenance in a Distributed Job Execution System

We examine provenance in the context of a distributed job execution system. It is crucial to capture provenance information during the execution of a job in a distributed environment because often this information is lost once the job has finished. In this paper we discuss the type of information that is available within a distributed job execution system, how to capture such information, and what the burdens on the user and system are when such information is captured. We identify what we think is the key data that must be captured and discuss the collection of provenance in the Quill++ project of Condor. Our conclusion is that it is possible to capture important provenance information in a distributed job execution system with relatively little intrusion on the user or the system.

[1]  Robert Meersman,et al.  On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE , 2003, Lecture Notes in Computer Science.

[2]  Yong Zhao,et al.  Chimera: a virtual data system for representing, querying, and automating data derivation , 2002, Proceedings 14th International Conference on Scientific and Statistical Database Management.

[3]  Miron Livny,et al.  Condor: a distributed job scheduler , 2001 .

[4]  Jennifer Widom,et al.  Storing auxiliary data for efficient maintenance and lineage tracing of complex views , 2000, DMDW.

[5]  Michael Stonebraker,et al.  Supporting fine-grained data lineage in a database visualization environment , 1997, Proceedings 13th International Conference on Data Engineering.

[6]  Alexandra Poulovassilis,et al.  Tracing Data Lineage Using Schema Transformation Pathways , 2003, Knowledge Transformation for the Semantic Web.

[7]  Roger S. Barga,et al.  Automatic Generation of Workflow Provenance , 2006, IPAW.

[8]  Yogesh L. Simmhan,et al.  A survey of data provenance techniques , 2005 .

[9]  Luc Moreau,et al.  Recording and Reasoning over Data Provenance in Web and Grid Services , 2003, OTM.

[10]  Simon Miles Electronically Querying for the Provenance of Entities , 2006, IPAW.

[11]  James Frew,et al.  Earth System Science Workbench: a data management infrastructure for earth science products , 2001, Proceedings Thirteenth International Conference on Scientific and Statistical Database Management. SSDBM 2001.

[12]  James Frew,et al.  Lineage retrieval for scientific data processing: a survey , 2005, CSUR.

[13]  Margo I. Seltzer,et al.  Issues in Automatic Provenance Collection , 2006, IPAW.

[14]  Jennifer Widom,et al.  Lineage tracing for general data warehouse transformations , 2003, The VLDB Journal.

[15]  Roger Barga,et al.  Automatic Generation of Workflow Execution Provenance , 2006 .

[16]  Yogesh L. Simmhan,et al.  A survey of data provenance in e-science , 2005, SGMD.

[17]  Jennifer Widom,et al.  Trio: A System for Integrated Management of Data, Accuracy, and Lineage , 2004, CIDR.

[18]  William Gropp,et al.  Beowulf Cluster Computing with Linux , 2003 .

[19]  Hao Fan Tracing Data Lineage Using Automed Schema Transformation Pathways , 2002, BNCOD.

[20]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.

[21]  Michel C. A. Klein,et al.  Knowledge Transformation for the Semantic Web , 2003, Frontiers in Artificial Intelligence and Applications.