Selective Provenance for Datalog Programs Using Top-K Queries

Highly expressive declarative languages, such as datalog, are now commonly used to model the operational logic of data-intensive applications. The typical complexity of such datalog programs, and the large volume of data that they process, call for result explanation. Results may be explained through the tracking and presentation of data provenance, and here we focus on a detailed form of provenance (how-provenance), defining it as the set of derivation trees of a given fact. While informative, the size of such full provenance information is typically too large and complex (even when compactly represented) to allow displaying it to the user. To this end, we propose a novel top-k query language for querying datalog provenance, supporting selection criteria based on tree patterns and ranking based on the rules and database facts used in derivation. We propose an efficient novel algorithm based on (1) instrumenting the datalog program so that, upon evaluation, it generates only relevant provenance, and (2) efficient top-k (relevant) provenance generation, combined with bottom-up datalog evaluation. The algorithm computes in polynomial data complexity a compact representation of the top-k trees which may be explicitly constructed in linear time with respect to their size. We further experimentally study the algorithm performance, showing its scalability even for complex datalog programs where full provenance tracking is infeasible.

[1]  Jianxin Li,et al.  Top-k keyword search over probabilistic XML data , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[2]  Yong Zhao,et al.  Chimera: a virtual data system for representing, querying, and automating data derivation , 2002, Proceedings 14th International Conference on Scientific and Statistical Database Management.

[3]  James Cheney,et al.  Database Queries that Explain their Work , 2014, PPDP '14.

[4]  Richard Hull,et al.  Business Artifacts: A Data-centric Approach to Modeling Business Operations and Processes , 2009, IEEE Data Eng. Bull..

[5]  Val Tannen,et al.  Querying data provenance , 2010, SIGMOD Conference.

[6]  Daniel Deutch,et al.  On probabilistic fixpoint and Markov chain query languages , 2010, PODS '10.

[7]  Oded Shmueli,et al.  Automated interaction in social networks with datalog , 2010, CIKM.

[8]  Shazia Wasim Sadiq,et al.  Efficient provenance storage for relational queries , 2012, CIKM '12.

[9]  Christopher Ré,et al.  DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference , 2012, VLDS.

[10]  Ofer Strichman,et al.  A New Class of Lineage Expressions over Probabilistic Databases Computable in P-Time , 2013, SUM.

[11]  Prasoon Goyal,et al.  Probabilistic Databases , 2009, Encyclopedia of Database Systems.

[12]  Dan Suciu,et al.  A formal approach to finding explanations for database queries , 2014, SIGMOD Conference.

[13]  Yehoshua Sagiv,et al.  Matching Twigs in Probabilistic XML , 2007, VLDB.

[14]  Fabian M. Suchanek,et al.  AMIE: association rule mining under incomplete evidence in ontological knowledge bases , 2013, WWW.

[15]  James Cheney,et al.  On the expressiveness of implicit provenance in query and update languages , 2008, TODS.

[16]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .

[17]  Tova Milo,et al.  Labeling recursive workflow executions on-the-fly , 2011, SIGMOD '11.

[18]  Dan Olteanu,et al.  Aggregation in Probabilistic Databases via Knowledge Compilation , 2012, Proc. VLDB Endow..

[19]  Jennifer Widom,et al.  Databases with uncertainty and lineage , 2008, The VLDB Journal.

[20]  Daniel Deutch,et al.  Circuits for Datalog Provenance , 2014, ICDT.

[21]  Sara Cohen,et al.  Querying parse trees of stochastic context-free grammars , 2010, ICDT '10.

[22]  Dan Suciu,et al.  Tiresias: the database oracle for how-to queries , 2012, SIGMOD Conference.

[23]  Gustavo Alonso,et al.  TRAMP: Understanding the Behavior of Schema Mappings through Provenance , 2010, Proc. VLDB Endow..

[24]  Yehoshua Sagiv,et al.  Query evaluation over probabilistic XML , 2009, The VLDB Journal.

[25]  Gustavo Alonso,et al.  Perm: Processing Provenance and Data on the Same Data Model through Query Rewriting , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[26]  Ion Stoica,et al.  Declarative networking: language, execution and optimization , 2006, SIGMOD Conference.

[27]  Andreas Haeberlen,et al.  Querying Provenance for Ranking and Recommending , 2012, TaPP.

[28]  Jakub Závodný,et al.  Factorised representations of query results: size bounds and readability , 2012, ICDT '12.

[29]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[30]  Val Tannen,et al.  Provenance semirings , 2007, PODS.

[31]  Norman W. Paton,et al.  Fine-grained and efficient lineage querying of collection-based workflow provenance , 2010, EDBT '10.

[32]  Anastasia Ailamaki,et al.  Scientific workflow management by database management , 1998, Proceedings. Tenth International Conference on Scientific and Statistical Database Management (Cat. No.98TB100243).

[33]  Renée J. Miller,et al.  Provenance for Data Mining , 2013, TaPP.

[34]  Dan Suciu,et al.  Probabilistic Databases with MarkoViews , 2012, Proc. VLDB Endow..

[35]  Jeffrey Xu Yu,et al.  Efficient processing of top-k twig queries over probabilistic XML data , 2011, World Wide Web.

[36]  Norbert Fuhr,et al.  Probabilistic Datalog—a logic for powerful retrieval methods , 1995, SIGIR '95.

[37]  Adriane Chapman,et al.  Efficient provenance storage , 2008, SIGMOD Conference.

[38]  Daniel Deutch,et al.  selP: Selective tracking and presentation of data provenance , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[39]  Yogesh L. Simmhan,et al.  Karma2: Provenance Management for Data-Driven Workflows , 2008, Int. J. Web Serv. Res..

[40]  James Cheney,et al.  Functional programs that explain their work , 2012, ICFP.

[41]  Gustavo Alonso,et al.  Using SQL for Efficient Generation and Querying of Provenance Information , 2013, In Search of Elegance in the Theory and Practice of Computation.

[42]  David Eppstein,et al.  Finding the k Shortest Paths , 1999, SIAM J. Comput..

[43]  Juliana Freire,et al.  Provenance and scientific workflows: challenges and opportunities , 2008, SIGMOD Conference.

[44]  Donald E. Knuth,et al.  A Generalization of Dijkstra's Algorithm , 1977, Inf. Process. Lett..

[45]  Matthew Richardson,et al.  Markov logic networks , 2006, Machine Learning.

[46]  Dan Suciu,et al.  Reverse data management , 2011, Proc. VLDB Endow..

[47]  Antonella Poggi,et al.  On database query languages for K-relations , 2010, J. Appl. Log..

[48]  Carole A. Goble,et al.  Taverna: a tool for building and running workflows of services , 2006, Nucleic Acids Res..

[49]  James Cheney,et al.  Provenance in Databases: Why, How, and Where , 2009, Found. Trends Databases.

[50]  Jeffrey Xu Yu,et al.  Query ranking in probabilistic XML data , 2009, EDBT '09.

[51]  Divesh Srivastava,et al.  Explaining Program Execution in Deductive Systems , 1993, DOOD.