Provenance management in curated databases

Curated databases in bioinformatics and other disciplines are the result of a great deal of manual annotation, correction and transfer of data from other sources. Provenance information concerning the creation, attribution, or version history of such data is crucial for assessing its integrity and scientific value. General purpose database systems provide little support for tracking provenance, especially when data moves among databases. This paper investigates general-purpose techniques for recording provenance for data that is copied among databases. We describe an approach in which we track the user's actions while browsing source databases and copying data into a curated database, in order to record the user's actions in a convenient, queryable form. We present an implementation of this technique and use it to evaluate the feasibility of database support for provenance management. Our experiments show that although the overhead of a naive approach is fairly high, it can be decreased to an acceptable level using simple optimizations.

[1]  Michael Stonebraker,et al.  Supporting fine-grained data lineage in a database visualization environment , 1997, Proceedings 13th International Conference on Data Engineering.

[2]  angesichts der Corona-Pandemie,et al.  UPDATE , 1973, The Lancet.

[3]  David Botstein,et al.  SGD: Saccharomyces Genome Database , 1998, Nucleic Acids Res..

[4]  Stéphane Bressan,et al.  Source Attribution for Querying Against Semi-structured Documents , 1998, Workshop on Web Information and Data Management.

[5]  Ian M. Donaldson,et al.  BIND: the Biomolecular Interaction Network Database , 2001, Nucleic Acids Res..

[6]  Wenfei Fan,et al.  Keys for XML , 2001, WWW '01.

[7]  Proceedings of the 2001 ACM SIGMOD international conference on Management of data , 2001, SIGMOD 2001.

[8]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.

[9]  Amélie Marian,et al.  Change-Centric Management of Versions in an XML Warehouse , 2001, VLDB.

[10]  Alon Y. Halevy,et al.  Updating XML , 2001, SIGMOD '01.

[11]  Yong Zhao,et al.  Chimera: a virtual data system for representing, querying, and automating data derivation , 2002, Proceedings 14th International Conference on Scientific and Statistical Database Management.

[12]  Cong Yu,et al.  TIMBER: A native XML database , 2002, The VLDB Journal.

[13]  Keishi Tajima,et al.  Archiving scientific data , 2004, TODS.

[14]  Graham Dellaire,et al.  The Nuclear Protein Database (NPD): sub-nuclear localisation and functional annotation of the nuclear proteome , 2003, Nucleic Acids Res..

[15]  Jennifer Widom,et al.  Lineage tracing for general data warehouse transformations , 2003, The VLDB Journal.

[16]  Wang Chiew Tan Containment of Relational Queries with Annotation Propagation , 2003, DBPL.

[17]  Sarah A. Douglas,et al.  Implementation Challenges Associated with Developing a Web-based E-notebook - Addendum on Related Work , 2003, J. Digit. Inf..

[18]  Carole A. Goble,et al.  Semantically Linking and Browsing Provenance Logs for E-science , 2004, ICSNW.

[19]  A. Szalay,et al.  Batch Query System with Interactive Local Storage for SDSS and the VO , 2004 .

[20]  Yogesh L. Simmhan,et al.  A survey of data provenance in e-science , 2005, SGMD.

[21]  James Frew,et al.  Lineage retrieval for scientific data processing: a survey , 2005, CSUR.

[22]  David J. DeWitt,et al.  Scientific data management in the coming decade , 2005, SGMD.

[23]  Wang Chiew Tan,et al.  An annotation management system for relational databases , 2004, The VLDB Journal.

[24]  Jennifer Widom,et al.  Trio: A System for Integrated Management of Data, Accuracy, and Lineage , 2004, CIDR.

[25]  Nuwee Wiwatwattana,et al.  Organelle DB: a cross-species database of protein localization and function , 2004, Nucleic Acids Res..

[26]  Paul T. Groth,et al.  Recording and using provenance in a protein compressibility experiment , 2005, HPDC-14. Proceedings. 14th IEEE International Symposium on High Performance Distributed Computing, 2005..

[27]  Michael Y. Galperin The Molecular Biology Database Collection: 2006 update , 2005, Nucleic Acids Res..

[28]  Benjamin C. Pierce,et al.  Combinators for bi-directional tree transformations: a linguistic approach to the view update problem , 2005, POPL '05.