Approximated Summarization of Data Provenance

Many modern applications involve collecting large amounts of data from multiple sources, and then aggregating and manipulating it in intricate ways. The complexity of such applications, combined with the size of the collected data, makes it difficult to understand how the resulting information was derived. Data provenance has proven helpful in this respect, however, maintaining and presenting the full and exact provenance information may be infeasible due to its size and complexity. We therefore introduce the notion of approximated summarized provenance, which provides a compact representation of the provenance at the possible cost of information loss. Based on this notion, we present a novel provenance summarization algorithm which, based on the semantics of the underlying data and the intended use of provenance, outputs a summary of the input provenance. Experiments measure the conciseness and accuracy of the resulting provenance summaries, and improvement in provenance usage time.

[1]  Val Tannen,et al.  Provenance semirings , 2007, PODS.

[2]  Daniel Deutch,et al.  Provenance for aggregate queries , 2011, PODS.

[3]  Charles M. Grinstead,et al.  Introduction to probability , 1999, Statistics for the Behavioural Sciences.

[4]  Carole A. Goble,et al.  Taverna: a tool for building and running workflows of services , 2006, Nucleic Acids Res..

[5]  Bertram Ludäscher,et al.  Provenance in Scientific Workflow Systems , 2007, IEEE Data Eng. Bull..

[6]  Jennifer Widom,et al.  Databases with uncertainty and lineage , 2008, The VLDB Journal.

[7]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[8]  Daniel Deutch,et al.  Putting Lipstick on Pig: Enabling Database-style Workflow Provenance , 2011, Proc. VLDB Endow..

[9]  Christopher Ré,et al.  Approximate lineage for probabilistic databases , 2008, Proc. VLDB Endow..

[10]  Daniel Deutch,et al.  A Provenance Framework for Data-Dependent Process Analysis , 2014, Proc. VLDB Endow..

[11]  Jakub Závodný,et al.  Factorised representations of query results: size bounds and readability , 2012, ICDT '12.

[12]  Norman W. Paton,et al.  Fine-grained and efficient lineage querying of collection-based workflow provenance , 2010, EDBT '10.

[13]  Anastasia Ailamaki,et al.  Scientific workflow management by database management , 1998, Proceedings. Tenth International Conference on Scientific and Statistical Database Management (Cat. No.98TB100243).

[14]  Renée J. Miller,et al.  Provenance for Data Mining , 2013, TaPP.

[15]  Yogesh L. Simmhan,et al.  Karma2: Provenance Management for Data-Driven Workflows , 2008, Int. J. Web Serv. Res..

[16]  James Cheney,et al.  Provenance in Databases: Why, How, and Where , 2009, Found. Trends Databases.

[17]  Juliana Freire,et al.  Provenance and scientific workflows: challenges and opportunities , 2008, SIGMOD Conference.

[18]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[19]  Richard Hull,et al.  Business Artifacts: A Data-centric Approach to Modeling Business Operations and Processes , 2009, IEEE Data Eng. Bull..

[20]  James Cheney,et al.  On the expressiveness of implicit provenance in query and update languages , 2008, TODS.

[21]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.

[22]  Yong Zhao,et al.  Chimera: a virtual data system for representing, querying, and automating data derivation , 2002, Proceedings 14th International Conference on Scientific and Statistical Database Management.

[23]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[24]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[25]  Dan Olteanu,et al.  Aggregation in Probabilistic Databases via Knowledge Compilation , 2012, Proc. VLDB Endow..