Recording and Reasoning over Data Provenance in Web and Grid Services

Large-scale, dynamic and open environments such as the Grid and Web Services build upon existing computing infrastructures to supply dependable and consistent large-scale computational systems. This kind of architecture has been adopted by the business and scientific communities allowing them to exploit extensive and diverse computing resources to perform complex data processing tasks. In such systems, results are often derived by composing multiple, geographically distributed, heterogeneous services as specified by intricate workflow management. This leads to the undesirable situation where the results are known, but the means by which they were achieved is not. With both scientific experiments and business transactions, the notion of lineage and dataset derivation is of paramount importance since without it, information is potentially worthless. We address the issue of data provenance, the description of the origin of a piece of data, in these environments showing the requirements, uses and implementation difficulties. We propose an infrastructure level support for a provenance recording capability for service-oriented architectures such as the Grid and Web Services. We also offer services to view and retrieve provenance and we provide a mechanism by which provenance is used to determine whether previous computed results are still up to date.

[1]  Matjaz B. Juric,et al.  Business process execution language for web services , 2004 .

[2]  Yong Zhao,et al.  Chimera: a virtual data system for representing, querying, and automating data derivation , 2002, Proceedings 14th International Conference on Scientific and Statistical Database Management.

[3]  Tom Rodden,et al.  On the Use of Agents in a BioInformati s , 2003 .

[4]  Alin Deutsch,et al.  A deterministic model for semistructured data , 1999 .

[5]  Ian T. Foster,et al.  The virtual data grid: a new model and architecture for data-intensive collaboration , 2003, 15th International Conference on Scientific and Statistical Database Management, 2003..

[6]  Ian T. Foster,et al.  The anatomy of the grid: enabling scalable virtual organizations , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[7]  Steven Tuecke,et al.  Enabling Scalable Virtual Organizations , 2001 .

[8]  Nicholas R. Jennings,et al.  The Semantic Grid: A Future e‐Science Infrastructure , 2003 .

[9]  Steven Tuecke,et al.  The Physiology of the Grid An Open Grid Services Architecture for Distributed Systems Integration , 2002 .

[10]  Alin Deutsch,et al.  Workshop on Query Processing for Semistructured Data and Non-Standard Data Formats , 1999 .

[11]  Luc Moreau,et al.  Provenance of e-Science Experiments - Experience from Bioinformatics , 2003 .

[12]  Michael Luck,et al.  Agent technology: Enabling next generation computing , 2003 .

[13]  Peter Buneman,et al.  Data annotations, provenance, and archiving , 2002 .

[14]  S. Krishnan,et al.  2 XLANG : Web Services for Business Process Design , 2002 .

[15]  Michael Luck,et al.  On the use of agents in a BioInformatics grid , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[16]  Ian T. Foster,et al.  Grid Services for Distributed System Integration , 2002, Computer.

[17]  Luc Moreau,et al.  Extending execution tracing for mobile code security , 2002 .

[18]  Steven Tuecke,et al.  The Anatomy of the Grid , 2003 .

[19]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.