Incorporating Provenance in Database Systems

The importance of maintaining provenance has been widely recognized, particularly with respect to highly-manipulated data. Currently there are two approaches: provenance generated within workflow frameworks, and provenance within a contained relational database. The workflow provenance allows workflow re-execution, and can offer some explanation of results. Within relational databases, knowledge of SQL queries and relational operators is used to express what happened to a tuple. There is a disconnect between these two areas of provenance research. Techniques that work in relational databases cannot be applied to workflow systems because of heterogeneous data types and black-box operators. Meanwhile, the real-life utility of workflow systems has not been extended to database provenance. In the gap between provenance in workflow systems and databases, there are myriads of systems that need provenance. For instance, when creating a new dataset, like MiMI, using several sources and processes, or building an algorithm that generates sequence alignments, like MiBlast. These hybrid systems cannot be mashed into a workflow framework and do not solely exist within a database. This work solves issues that block provenance usage in hybrid systems. In particular, we look at capturing, storing, and using provenance information outside of workflow and database provenance systems. We tackle the problem of how to capture provenance for manual tasks. Database provenance and workflow systems provide no support for tracking the provenance of user actions, but manual effort is often a large component of effort in these hybrid systems. We describe an approach to track and record the user's actions in a queryable form. Once provenance is captured, storage can become prohibitively expensive, in both hybrid and workflow systems. We utilize properties of provenance information and identify several techniques to reduce the provenance store. Additionally, usable provenance is a problem in workflow, database and hybrid provenance systems. Provenance contains both too much and too little information. Provenance from the black-boxes used in workflow and hybrid systems is impossible for a human to understand. We highlight the missing information that can assist user understanding, and develop a model of provenance answers to decrease information overload. Finally, workflow and database systems are designed to explain the results users see; they do not explain why items are not in the result. We allow researchers to specify what they are looking for and answer why it does not exist in the result set.

[1]  Wenfei Fan,et al.  Annotation propagation revisited for key preserving views , 2006, CIKM '06.

[2]  Roger S. Barga,et al.  Automatic capture and efficient storage of e‐Science experiment provenance , 2008, Concurr. Comput. Pract. Exp..

[3]  Stuart E. Madnick,et al.  A Polygen Model for Heterogeneous Database Systems: The Source Tagging Perspective , 1990, VLDB.

[4]  Karen Schuchardt,et al.  Multi-scale Science: Supporting Emerging Practice with Semantically Derived Provenance , 2003 .

[5]  John Mylopoulos,et al.  Representing and querying data transformations , 2005, 21st International Conference on Data Engineering (ICDE'05).

[6]  Jennifer Widom,et al.  Lineage tracing for general data warehouse transformations , 2003, The VLDB Journal.

[7]  James Cheney,et al.  A Provenance Model for Manually Curated Data , 2006, IPAW.

[8]  Gustavo Alonso,et al.  Geo-Opera: Workflow Concepts for Spatial Processes , 1997, SSD.

[9]  Jayant R. Haritsa,et al.  XGrind: a query-friendly XML compressor , 2002, Proceedings 18th International Conference on Data Engineering.

[10]  Paul T. Groth,et al.  PReServ: Provenance Recording for Services , 2005 .

[11]  Adriane Chapman,et al.  Issues in Building Practical Provenance Systems , 2007, IEEE Data Eng. Bull..

[12]  Xiang Zhang,et al.  Tracing Lineage Beyond Relational Operators , 2007, VLDB.

[13]  Ian Foster,et al.  The First Provenance Challenge , 2008 .

[14]  Laks V. S. Lakshmanan,et al.  X^ 3: A Cube Operator for XML OLAP , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[15]  Michael Luck,et al.  A Protocol for Recording Provenance in Service-Oriented Grids , 2004, OPODIS.

[16]  Hanno Steen,et al.  Development of human protein reference database as an initial platform for approaching systems biology in humans. , 2003, Genome research.

[17]  Adriane Chapman,et al.  Efficient provenance storage , 2008, SIGMOD Conference.

[18]  Ian T. Foster,et al.  The virtual data grid: a new model and architecture for data-intensive collaboration , 2003, 15th International Conference on Scientific and Statistical Database Management, 2003..

[19]  Alun D. Preece,et al.  Managing Information Quality in e-Science: A Case Study in Proteomics , 2005, ER.

[20]  Wenfei Fan,et al.  Keys for XML , 2002, Comput. Networks.

[21]  David Maier,et al.  Scientific Exploration in the Era of Ocean Observatories , 2008, Computing in Science & Engineering.

[22]  J. Mesirov,et al.  GenePattern 2.0 , 2006, Nature Genetics.

[23]  Adriane Chapman,et al.  Making database systems usable , 2007, SIGMOD '07.

[24]  Bertram Ludäscher,et al.  Provenance in Scientific Workflow Systems , 2007, IEEE Data Eng. Bull..

[25]  Susan B. Davidson,et al.  Towards a Model of Provenance and User Views in Scientific Workflows , 2006, DILS.

[26]  Val Tannen,et al.  ORCHESTRA: facilitating collaborative data sharing , 2007, SIGMOD '07.

[27]  Carole Goble,et al.  myExperiment – A Web 2.0 Virtual Research Environment , 2007 .

[28]  Carole A. Goble,et al.  An Identity Crisis in the Life Sciences , 2006, IPAW.

[29]  Cong Yu,et al.  TIMBER: a native system for querying XML , 2003, SIGMOD '03.

[30]  James Frew,et al.  Automatic capture and reconstruction of computational provenance , 2008 .

[31]  Cláudio T. Silva,et al.  Querying and Creating Visualizations by Analogy , 2007, IEEE Transactions on Visualization and Computer Graphics.

[32]  Kaizar Amin,et al.  Metadata in the Collaboratory for Multi-Scale Chemical Science , 2003, Dublin Core Conference.

[33]  Stéphane Bressan,et al.  Source Attribution for Querying Against Semi-structured Documents , 1998, Workshop on Web Information and Data Management.

[34]  Jennifer Widom,et al.  Practical lineage tracing in data warehouses , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[35]  Cong Yu,et al.  TIMBER: A native XML database , 2002, The VLDB Journal.

[36]  Lan V. Zhang,et al.  Evidence for dynamically organized modularity in the yeast protein–protein interaction network , 2004, Nature.

[37]  Carole A. Goble,et al.  Semantically Linking and Browsing Provenance Logs for E-science , 2004, ICSNW.

[38]  Klaus R. Dittrich,et al.  Data Provenance: A Categorization of Existing Approaches , 2007, BTW.

[39]  Michael Stonebraker,et al.  Supporting fine-grained data lineage in a database visualization environment , 1997, Proceedings 13th International Conference on Data Engineering.

[40]  Gary D. Bader,et al.  BIND-a data specification for storing and describing biomolecular interactions, molecular complexes and pathways , 2000, Bioinform..

[41]  Sanjeev Khanna,et al.  Edinburgh Research Explorer On the Propagation of Deletions and Annotations through Views , 2013 .

[42]  Arie Shoshani,et al.  Summarizability in OLAP and statistical data bases , 1997, Proceedings. Ninth International Conference on Scientific and Statistical Database Management (Cat. No.97TB100150).

[43]  Nuwee Wiwatwattana,et al.  Organelle DB: a cross-species database of protein localization and function , 2004, Nucleic Acids Res..

[44]  Parag Agrawal,et al.  Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS (Demo) , 2007, CIDR.

[45]  Li Zhao,et al.  Managing Large-Scale Workflow Execution from Resource Provisioning to Provenance Tracking: The CyberShake Example , 2006, 2006 Second IEEE International Conference on e-Science and Grid Computing (e-Science'06).

[46]  Paul T. Groth,et al.  Provenance-based validation of e-science experiments , 2005, J. Web Semant..

[47]  Kian-Lee Tan,et al.  Verifying completeness of relational query results in data publishing , 2005, SIGMOD '05.

[48]  Shmuel Sattath,et al.  How reliable are experimental protein-protein interaction data? , 2003, Journal of molecular biology.

[49]  Sebastian Maneth,et al.  Efficient Memory Representation of XML Documents , 2005, DBPL.

[50]  Cláudio T. Silva,et al.  VisTrails: enabling interactive multiple-view visualizations , 2005, VIS 05. IEEE Visualization, 2005..

[51]  Chaki Ng,et al.  Provenance-Aware Sensor Data Storage , 2005, 21st International Conference on Data Engineering Workshops (ICDEW'05).

[52]  Henrico Dolfing,et al.  MONDRIAN: Annotating and querying databases through colors and blocks , 2006 .

[53]  Mike Tyers,et al.  BioGRID: a general repository for interaction datasets , 2005, Nucleic Acids Res..

[54]  Cláudio T. Silva,et al.  VisTrails: visualization meets data management , 2006, SIGMOD Conference.

[55]  Bertram Ludäscher,et al.  Project Histories: Managing Data Provenance Across Collection-Oriented Scientific Workflow Runs , 2007, DILS.

[56]  Benjamin C. Pierce,et al.  Combinators for bi-directional tree transformations: a linguistic approach to the view update problem , 2005, POPL '05.

[57]  Christopher W. Fraser,et al.  Code compression , 1997, PLDI '97.

[58]  Mark Greenwood,et al.  Taverna: lessons in creating a workflow environment for the life sciences: Research Articles , 2006 .

[59]  Ian T. Foster,et al.  Grid Services for Distributed System Integration , 2002, Computer.

[60]  Brad A. Myers,et al.  Designing the whyline: a debugging interface for asking questions about program behavior , 2004, CHI.

[61]  Wil M. P. van der Aalst,et al.  Workflow mining: discovering process models from event logs , 2004, IEEE Transactions on Knowledge and Data Engineering.

[62]  James Frew,et al.  Lineage retrieval for scientific data processing: a survey , 2005, CSUR.

[63]  James Cheney,et al.  On the expressiveness of implicit provenance in query and update languages , 2008, TODS.

[64]  Marta Mattoso,et al.  Provenance Services for Distributed Workflows , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[65]  Paul T. Groth,et al.  A model of process documentation to determine provenance in mash-ups , 2009, TOIT.

[66]  Cathy H. Wu,et al.  InterPro, progress and status in 2005 , 2004, Nucleic Acids Res..

[67]  Suresh Marru,et al.  On-Demand Severe Weather Forecasts Using TeraGrid via the LEAD Portal , 2008 .

[68]  Bertram Ludäscher,et al.  Managing scientific data: From data integration to scientific workflows* , 2006 .

[69]  Cláudio T. Silva,et al.  Visualization in Radiation Oncology: Towards Replacing the Laboratory Notebook (SCI Institute Technical Report, No. UUSCI-2006-17) , 2006 .

[70]  Alun D. Preece,et al.  Managing information quality in e-science: the qurator workbench , 2007, SIGMOD '07.

[71]  Susan B. Davidson,et al.  Addressing the provenance challenge using ZOOM , 2008, Concurr. Comput. Pract. Exp..

[72]  Pascal Heus,et al.  QIS-XML: A metadata specification for Quantum Information Science , 2007, ArXiv.

[73]  Yong Zhao,et al.  A notation and system for expressing and executing cleanly typed workflows on messy scientific data , 2005, SGMD.

[74]  Roger Barga,et al.  Automatic capture and efficient storage of e-Science experiment provenance , 2008 .

[75]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.

[76]  Ian M. Donaldson,et al.  BIND: the Biomolecular Interaction Network Database , 2001, Nucleic Acids Res..

[77]  Norman W. Paton,et al.  Adaptive Workflow Processing and Execution in Pegasus , 2008, 2008 The 3rd International Conference on Grid and Pervasive Computing - Workshops.

[78]  Simon Miles,et al.  PrIMe: a software engineering methodology for developing provenance-aware applications , 2006, SEM '06.

[79]  Adam J. Smith,et al.  The Database of Interacting Proteins: 2004 update , 2004, Nucleic Acids Res..

[80]  Wang Chiew Tan,et al.  An annotation management system for relational databases , 2004, The VLDB Journal.

[81]  Chin-Wan Chung,et al.  XPRESS: a queriable compression for XML data , 2003, SIGMOD '03.

[82]  David Botstein,et al.  SGD: Saccharomyces Genome Database , 1998, Nucleic Acids Res..

[83]  Bin Liu,et al.  Michigan Molecular Interactions (MiMI): putting the jigsaw puzzle together , 2006, Nucleic Acids Res..

[84]  H. Lehrach,et al.  A Human Protein-Protein Interaction Network: A Resource for Annotating the Proteome , 2005, Cell.

[85]  Ilkay Altintas,et al.  Provenance Collection Support in the Kepler Scientific Workflow System , 2006, IPAW.

[86]  Sanjeev Khanna,et al.  Data Provenance: Some Basic Issues , 2000, FSTTCS.

[87]  Jennifer Widom,et al.  An Introduction to ULDBs and the Trio System , 2006, IEEE Data Eng. Bull..

[88]  James Frew,et al.  Earth System Science Workbench: a data management infrastructure for earth science products , 2001, Proceedings Thirteenth International Conference on Scientific and Statistical Database Management. SSDBM 2001.

[89]  Jeffrey F. Naughton,et al.  On the provenance of non-answers to queries over extracted data , 2008, Proc. VLDB Endow..

[90]  Paul T. Groth,et al.  Connecting Scientific Data to Scientific Experiments with Provenance , 2007, Third IEEE International Conference on e-Science and Grid Computing (e-Science 2007).

[91]  Keishi Tajima,et al.  Archiving scientific data , 2004, TODS.

[92]  Jennifer Widom,et al.  Trio: A System for Integrated Management of Data, Accuracy, and Lineage , 2004, CIDR.

[93]  Wang Chiew Tan,et al.  DBNotes: a post-it system for relational databases based on provenance , 2005, SIGMOD '05.

[94]  D. Eisenberg,et al.  Describing Biological Protein Interactions in Terms of Protein States and State Transitions , 2002, Molecular & Cellular Proteomics.

[95]  Dimitrios Gunopulos,et al.  Efficient and effective explanation of change in hierarchical summaries , 2007, KDD '07.

[96]  Dan Suciu,et al.  XMill: an efficient compressor for XML data , 2000, SIGMOD '00.

[97]  Margo I. Seltzer,et al.  Provenance-Aware Storage Systems , 2006, USENIX Annual Technical Conference, General Track.

[98]  Ioannis Xenarios,et al.  DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions , 2002, Nucleic Acids Res..

[99]  Robert D. Finn,et al.  Pfam: clans, web tools and services , 2005, Nucleic Acids Res..

[100]  Adriane Chapman,et al.  Provenance and the Price of Identity , 2008, IPAW.

[101]  J. Blake,et al.  Creating the Gene Ontology Resource : Design and Implementation The Gene Ontology Consortium 2 , 2001 .

[102]  Wang Chiew Tan,et al.  Research Problems in Data Provenance , 2004, IEEE Data Eng. Bull..

[103]  Bertram Ludäscher,et al.  Collection-Oriented Scientific Workflows for Integrating and Analyzing Biological Data , 2006, DILS.

[104]  Carlos Santos,et al.  Data and text mining Wnt pathway curation using automated natural language processing : combining statistical methods with partial and full parse for knowledge extraction , 2005 .

[105]  Peter Buneman,et al.  Edinburgh Research Explorer Path Queries on Compressed XML , 2022 .

[106]  Paul T. Groth,et al.  Recording and using provenance in a protein compressibility experiment , 2005, HPDC-14. Proceedings. 14th IEEE International Symposium on High Performance Distributed Computing, 2005..

[107]  Feng Chen,et al.  OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups , 2005, Nucleic Acids Res..

[108]  Yogesh L. Simmhan,et al.  Performance Evaluation of the Karma Provenance Framework for Scientific Workflows , 2006, IPAW.

[109]  Shiyong Lu,et al.  Scientific Workflow Provenance Querying with Security Views , 2008, 2008 The Ninth International Conference on Web-Age Information Management.

[110]  You Jung Kim,et al.  miBLAST: scalable evaluation of a batch of nucleotide sequence queries with BLAST , 2005, Nucleic acids research.

[111]  Golan Yona,et al.  BIOZON: a hub of heterogeneous biological data , 2006, Nucleic Acids Res..

[112]  Jane Hunter,et al.  Provenance Explorer - Customized Provenance Views Using Semantic Inferencing , 2006, SEMWEB.

[113]  James Frew,et al.  Composing lineage metadata with XML for custom satellite-derived data products , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[114]  Luc Moreau,et al.  The Open Provenance Model , 2007 .

[115]  James Cheney,et al.  Provenance management in curated databases , 2006, SIGMOD Conference.

[116]  Yong Zhao,et al.  Chimera: a virtual data system for representing, querying, and automating data derivation , 2002, Proceedings 14th International Conference on Scientific and Statistical Database Management.

[117]  Ori Sasson,et al.  ProtoNet: hierarchical classification of the protein space , 2003, Nucleic Acids Res..

[118]  Adriane Chapman,et al.  Effective Integration of Protein Data through Better Data Modeling , 2003, OMICS.

[119]  Wang Chiew Tan Containment of Relational Queries with Annotation Propagation , 2003, DBPL.

[120]  Val Tannen,et al.  Annotated XML: queries and provenance , 2008, PODS.

[121]  Graham Dellaire,et al.  The Nuclear Protein Database (NPD): sub-nuclear localisation and functional annotation of the nuclear proteome , 2003, Nucleic Acids Res..

[122]  Yogesh L. Simmhan,et al.  A survey of data provenance in e-science , 2005, SGMD.

[123]  E. Birney,et al.  The International Protein Index: An integrated database for proteomics experiments , 2004, Proteomics.

[124]  Bruce Presley,et al.  A Guide to Programming in C , 1997 .

[125]  Jennifer Widom,et al.  Tracing the lineage of view data in a warehousing environment , 2000, TODS.

[126]  Hosagrahar V. Jagadish,et al.  MiMI: Michigan molecular interactions , 2005 .

[127]  Gary D Bader,et al.  A Combined Experimental and Computational Strategy to Define Protein Interaction Networks for Peptide Recognition Modules , 2001, Science.

[128]  Brad A. Myers,et al.  Answering why and why not questions in user interfaces , 2006, CHI.

[129]  Ian T. Foster,et al.  Accelerating Medical Research using the Swift Workflow System , 2007, HealthGrid.

[130]  Yogesh L. Simmhan,et al.  A Framework for Collecting Provenance in Data-Centric Scientific Workflows , 2006, 2006 IEEE International Conference on Web Services (ICWS'06).

[131]  C. Deane,et al.  Protein Interactions , 2002, Molecular & Cellular Proteomics.

[132]  V. Vianu,et al.  Edinburgh Why and Where: A Characterization of Data Provenance , 2017 .

[133]  James Annis et al. Applying chimera virtual data concepts to cluster finding in the Sloan Sky Survey , 2002 .

[134]  Cláudio T. Silva,et al.  Querying and re-using workflows with VsTrails , 2008, SIGMOD Conference.

[135]  Alun D. Preece,et al.  Quality views: capturing and exploiting the user perspective on data quality , 2006, VLDB.

[136]  J. Cocke Global common subexpression elimination , 1970, Symposium on Compiler Optimization.

[137]  Yogesh L. Simmhan,et al.  Special Issue: The First Provenance Challenge , 2008, Concurr. Comput. Pract. Exp..

[138]  Val Tannen,et al.  K2/Kleisli and GUS: Experiments in integrated access to genomic data sources , 2001, IBM Syst. J..

[139]  Martin Vingron,et al.  IntAct: an open source molecular interaction database , 2004, Nucleic Acids Res..

[140]  Amélie Marian,et al.  Change-Centric Management of Versions in an XML Warehouse , 2001, VLDB.

[141]  Roberto Grossi,et al.  On Finding Commong Subtrees , 1993, Theor. Comput. Sci..

[142]  Omer F. Rana,et al.  Actor Provenance Capture With Ganglia , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).