Data Provenance: A Categorization of Existing Approaches

In many application areas like e-science and data-warehousing detailed information about the origin of data is required. This kind of information is often referred to as data provenance or data lineage. The provenance of a data item includes information about the processes and source data items that lead to its creation and current representation. The diversity of data representation models and application domains has lead to a number of more or less formal definitions of provenance. Most of them are limited to a special application domain, data representation model or data processing facility. Not surprisingly, the associated implementations are also restricted to some application domain and depend on a special data model. In this paper we give a survey of data provenance models and prototypes, present a general categorization scheme for provenance models and use this categorization scheme to study the properties of the existing approaches. This categorization enables us to distinguish between different kinds of provenance information and could lead to a better understanding of provenance in general. Besides the categorization of provenance types, it is important to include the storage, transformation and query requirements for the different kinds of provenance information and application domains in our considerations. The analysis of existing approaches will assist us in revealing open research problems in the area of data provenance.

[1]  James Frew,et al.  Lineage retrieval for scientific data processing: a survey , 2005, CSUR.

[2]  Dennis P. Groth Information provenance and the knowledge rediscovery problem , 2004 .

[3]  Jennifer Widom,et al.  Lineage tracing for general data warehouse transformations , 2003, The VLDB Journal.

[4]  Keishi Tajima,et al.  Archiving scientific data , 2004, TODS.

[5]  Daniel R. Montello,et al.  Spatial Information Theory A Theoretical Basis for GIS , 1995, Lecture Notes in Computer Science.

[6]  Paul T. Groth,et al.  An Architecture for Provenance Systems , 2006 .

[7]  Carole A. Goble,et al.  myGrid: personalised bioinformatics on the information grid , 2003, ISMB.

[8]  Jennifer Widom,et al.  Working Models for Uncertain Data , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[9]  Henrico Dolfing,et al.  MONDRIAN: Annotating and querying databases through colors and blocks , 2006 .

[10]  Paul T. Groth,et al.  An Architecture for Provenance Systems Executive Summary , 2006 .

[11]  Max J. Egenhofer,et al.  Qualitative Representation of Change , 1997, COSIT.

[12]  Dennis P. Groth,et al.  Information provenance and the knowledge rediscovery problem , 2004, Proceedings. Eighth International Conference on Information Visualisation, 2004. IV 2004..

[13]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.

[14]  James Annis et al. Applying chimera virtual data concepts to cluster finding in the Sloan Sky Survey , 2002 .

[15]  Ian T. Foster,et al.  The virtual data grid: a new model and architecture for data-intensive collaboration , 2003, 15th International Conference on Scientific and Statistical Database Management, 2003..

[16]  James Cheney,et al.  A Copy-and-Paste Model for Provenance in Curated Databases , 2005 .

[17]  James Cheney,et al.  Provenance management in curated databases , 2006, SIGMOD Conference.

[18]  Luc Moreau,et al.  Provenance of e-Science Experiments - Experience from Bioinformatics , 2003 .

[19]  Wang Chiew Tan,et al.  Research Problems in Data Provenance , 2004, IEEE Data Eng. Bull..

[20]  David Maier,et al.  Principles of dataspace systems , 2006, PODS '06.

[21]  Yogesh L. Simmhan,et al.  A survey of data provenance techniques , 2005 .

[22]  Carole A. Goble,et al.  Semantically Linking and Browsing Provenance Logs for E-science , 2004, ICSNW.

[23]  B. Buckles,et al.  A fuzzy representation of data for relational databases , 1982 .

[24]  Karen Schuchardt,et al.  Multi-scale Science: Supporting Emerging Practice with Semantically Derived Provenance , 2003 .

[25]  Richard A. Becker,et al.  Auditing of Data Analyses , 1986, SSDBM.

[26]  Alexandra Poulovassilis,et al.  Using Schema Transformation Pathways for Data Lineage Tracing , 2005, BNCOD.

[27]  Yong Zhao,et al.  Chimera: a virtual data system for representing, querying, and automating data derivation , 2002, Proceedings 14th International Conference on Scientific and Statistical Database Management.

[28]  Michael Stonebraker,et al.  Supporting fine-grained data lineage in a database visualization environment , 1997, Proceedings 13th International Conference on Data Engineering.

[29]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.

[30]  Wojciech Ziarko,et al.  Discovery through rough set theory , 1999, Commun. ACM.

[31]  Yannis Papakonstantinou,et al.  Object Fusion in Mediator Systems , 1996, VLDB.

[32]  Chaki Ng,et al.  Provenance-Aware Sensor Data Storage , 2005, 21st International Conference on Data Engineering Workshops (ICDEW'05).

[33]  Felix Naumann,et al.  Declarative Data Fusion - Syntax, Semantics, and Implementation , 2005, ADBIS.

[34]  Jennifer Widom,et al.  Trio: A System for Integrated Management of Data, Accuracy, and Lineage , 2004, CIDR.

[35]  Wang Chiew Tan,et al.  DBNotes: a post-it system for relational databases based on provenance , 2005, SIGMOD '05.

[36]  James Cheney,et al.  A Provenance Model for Manually Curated Data , 2006, IPAW.

[37]  Margo I. Seltzer,et al.  Provenance-Aware Storage Systems , 2006, USENIX ATC, General Track.

[38]  Luc Moreau,et al.  The myGrid Notification Service , 2003 .

[39]  Wang Chiew Tan,et al.  An annotation management system for relational databases , 2004, The VLDB Journal.

[40]  Jennifer Widom,et al.  An Introduction to ULDBs and the Trio System , 2006, IEEE Data Eng. Bull..

[41]  Carole A. Goble,et al.  Using Semantic Web Technologies for Representing E-science Provenance , 2004, SEMWEB.

[42]  Yogesh L. Simmhan,et al.  A Framework for Collecting Provenance in Data-Centric Scientific Workflows , 2006, 2006 IEEE International Conference on Web Services (ICWS'06).

[43]  Robert Stevens,et al.  Annotating, Linking and Browsing Provenance Logs for {e-Science} , 2003 .

[44]  Yogesh L. Simmhan,et al.  A survey of data provenance in e-science , 2005, SGMD.

[45]  William Kent,et al.  The breakdown of the information model in multi-database systems , 1991, SGMD.

[46]  Jennifer Widom,et al.  Tracing the lineage of view data in a warehousing environment , 2000, TODS.

[47]  Paul Avery,et al.  The griphyn project: towards petascale virtual data grids , 2001 .

[48]  Paul T. Groth,et al.  PReServ: Provenance Recording for Services , 2005 .

[49]  Jennifer Widom,et al.  Lineage tracing in data warehouses , 2001 .