Why and Where: A Characterization of Data Provenance

With the proliferation of database views and curated databases, the issue of data provenance - where a piece of data came from and the process by which it arrived in the database - is becoming increasingly important, especially in scientific databases where understanding provenance is crucial to the accuracy and currency of data. In this paper we describe an approach to computing provenance when the data of interest has been created by a database query. We adopt a syntactic approach and present results for a general data model that applies to relational databases as well as to hierarchical data such as XML. A novel aspect of our work is a distinction between "why" provenance (refers to the source data that had some influence on the existence of the data) and "where" provenance (refers to the location(s) in the source databases from which the data was extracted).

[1]  Anthony C. Klug On conjunctive queries containing inequalities , 1988, JACM.

[2]  Janet Daly Overview of the World Wide Web Consortium (W3C) (SIGs IA, USE). , 2000 .

[3]  J W Ballard,et al.  Data on the web? , 1995, Science.

[4]  Alin Deutsch,et al.  A deterministic model for semistructured data , 1999 .

[5]  Jennifer Widom,et al.  Practical lineage tracing in data warehouses , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[6]  Richard Durbin,et al.  Acedb --- a c. elegans database: syntactic definitions for the acedb data base manager , 1992 .

[7]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .

[8]  Alin Deutsch,et al.  XML-QL: A Query Language for XML , 1998 .

[9]  Limsoon Wong,et al.  Normal forms and conservative properties for query languages over collection types , 1993, PODS.

[10]  Jennifer Widom,et al.  Object exchange across heterogeneous information sources , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[11]  Alin Deutsch,et al.  A Query Language for XML , 1999, Comput. Networks.

[12]  Dan Suciu,et al.  A query language and optimization techniques for unstructured data , 1996, SIGMOD '96.

[13]  Jennifer Widom,et al.  View maintenance in a warehousing environment , 1995, SIGMOD '95.

[14]  Michael Stonebraker,et al.  Supporting fine-grained data lineage in a database visualization environment , 1997, Proceedings 13th International Conference on Data Engineering.

[15]  Dan Suciu,et al.  Data on the Web: From Relations to Semistructured Data and XML , 1999 .

[16]  Jennifer Widom,et al.  The Lorel query language for semistructured data , 1997, International Journal on Digital Libraries.