Large-Scale Provenance for Soufflé

Logic programming languages, such as Datalog, have seen a rise in popularity in recent years, now being widely used to answer questions about real world problems. This popularity stems from the wide applicability of logic as a domain specific language for a variety of problems including static program analysis, declarative networking and security analysis. The use of logic reduces the complexity of implementing applications as programs are written in a declarative fashion. In other words, instead of describing computational steps imperatively, computations are concisely specified by their intended result. As a consequence, however, the lack of computational steps makes debugging very challenging, and there is no clear debugging strategy in logic programming. In the field of database research, provenance has been introduced as a way of explaining the output of relations, and can also be applied to Datalog. The big challenge with large real-world problems, however, is that the relations may contain billions of tuples, rendering existing provenance approaches infeasible. Hence, there is still a gap to design and implement scalable and reusable provenance systems for high-performance open-source Datalog engines, such as Soufflé. In this thesis, we develop the theory of provenance in the form of a proof tree for a given tuple. However, the construction of the proof tree in a naïve fashion is too expensive for large datasets. To ensure scalability, we push computation from logic evaluation time to proof construction time, achieving an efficient space/time trade-off. This lazy evaluation approach produces an annotated intensional database, which can be queried after evaluation by an arbitrary number of provenance queries without the need for recomputing the intensional database. We conduct experiments with complex industrialstrength benchmarks, including DOOP with DaCapo, which produce hundreds of millions of output tuples. We demonstrate that our novel provenance approach incurs a runtime and memory consumption overhead of 1.5× on average. Thus, it can cope with large datasets, where existing techniques and a naïve implementation become infeasible.

[1]  A. Tarski A LATTICE-THEORETICAL FIXPOINT THEOREM AND ITS APPLICATIONS , 1955 .

[2]  Catriel Beeri,et al.  On the power of magic , 1987, J. Log. Program..

[3]  Jeffrey D. Ullman,et al.  Bottom-up beats top-down for datalog , 1989, PODS '89.

[4]  Letizia Tanca,et al.  What you Always Wanted to Know About Datalog (And Never Dared to Ask) , 1989, IEEE Trans. Knowl. Data Eng..

[5]  Divesh Srivastava,et al.  Explaining Program Execution in Deductive Systems , 1993, DOOD.

[6]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .

[7]  Bertram Ludäscher,et al.  On Active Deductive Databases: The Statelog Approach , 1996, Transactions and Change in Logic Databases.

[8]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.

[9]  Wang Chiew Tan,et al.  An annotation management system for relational databases , 2004, The VLDB Journal.

[10]  Monica S. Lam,et al.  Using Datalog with Binary Decision Diagrams for Program Analysis , 2005, APLAS.

[11]  Andrew W. Appel,et al.  MulVAL: A Logic-based Network Security Analyzer , 2005, USENIX Security Symposium.

[12]  Konstantinos Sagonas,et al.  XSB: An Overview of its Use and Implementation , 2007 .

[13]  Adriane Chapman,et al.  Efficient provenance storage , 2008, SIGMOD Conference.

[14]  James Cheney,et al.  Provenance in Databases: Why, How, and Where , 2009, Found. Trends Databases.

[15]  Gustavo Alonso,et al.  Perm: Processing Provenance and Data on the Same Data Model through Query Rewriting , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[16]  Yannis Smaragdakis,et al.  Strictly declarative specification of sophisticated points-to analyses , 2009, OOPSLA '09.

[17]  Val Tannen,et al.  Querying data provenance , 2010, SIGMOD Conference.

[18]  Xiaozhou Li,et al.  Efficient querying and maintenance of network provenance at internet-scale , 2010, SIGMOD Conference.

[19]  Shan Shan Huang,et al.  Datalog and emerging applications: an interactive tutorial , 2011, SIGMOD '11.

[20]  Shazia Wasim Sadiq,et al.  Efficient provenance storage for relational queries , 2012, CIKM '12.

[21]  Andreas Haeberlen,et al.  Querying Provenance for Ranking and Recommending , 2012, TaPP.

[22]  Bertram Ludäscher,et al.  Declarative Datalog Debugging for Mere Mortals , 2012, Datalog.

[23]  Gustavo Alonso,et al.  Using SQL for Efficient Generation and Querying of Provenance Information , 2013, In Search of Elegance in the Theory and Practice of Computation.

[24]  Daniel Deutch,et al.  Circuits for Datalog Provenance , 2014, ICDT.

[25]  Emir Pasalic,et al.  Design and Implementation of the LogicBlox System , 2015, SIGMOD Conference.

[26]  Padmanabhan Krishnan,et al.  Staged Points-to Analysis for Large Code Bases , 2015, CC.

[27]  Till Westmann,et al.  A Datalog Source-to-Source Translator for Static Program Analysis: An Experience Report , 2015, 2015 24th Australasian Software Engineering Conference.

[28]  Daniel Deutch,et al.  Selective Provenance for Datalog Programs Using Top-K Queries , 2015, Proc. VLDB Endow..

[29]  Sergio Greco,et al.  Datalog and Logic Databases , 2015, Synthesis Lectures on Data Management.

[30]  Bernhard Scholz,et al.  Soufflé: On Synthesis of Program Analyzers , 2016, CAV.

[31]  Bertram Ludäscher,et al.  Efficiently Computing Provenance Graphs for Queries with Negation , 2017, ArXiv.

[32]  Yannis Smaragdakis,et al.  Porting doop to Soufflé: a tale of inter-engine portability for Datalog-based analyses , 2017, SOAP@PLDI.