Provenance-Aware Query Optimization

Data provenance is essential for debugging query results, auditing data in cloud environments, and explaining outputs of Big Data analytics. A well-established technique is to represent provenance as annotations on data and to instrument queries to propagate these annotations to produce results annotated with provenance. However, even sophisticated optimizers are often incapable of producing efficient execution plans for instrumented queries, because of their inherent complexity and unusual structure. Thus, while instrumentation enables provenance support for databases without requiring any modification to the DBMS, the performance of this approach is far from optimal. In this work, we develop provenancespecific optimizations to address this problem. Specifically, we introduce algebraic equivalences targeted at instrumented queries and discuss alternative, equivalent ways of instrumenting a query for provenance capture. Furthermore, we present an extensible heuristic and cost-based optimization (CBO) framework that governs the application of these optimizations and implement this framework in our GProM provenance system. Our CBO is agnostic to the plan space shape, uses a DBMS for cost estimation, and enables retrofitting of optimization choices into existing code by adding a few LOC. Our experiments confirm that these optimizations are highly effective, often improving performance by several orders of magnitude for diverse provenance tasks.

[1]  Xiaozhou Li,et al.  Efficient querying and maintenance of network provenance at internet-scale , 2010, SIGMOD Conference.

[2]  Michael Stonebraker,et al.  SubZero: A fine-grained lineage system for scientific databases , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[3]  Hong Su,et al.  Cost-based query transformation in Oracle , 2006, VLDB.

[4]  Grigoris Karvounarakis,et al.  Semiring-annotated data: queries and provenance? , 2012, SGMD.

[5]  Venkatesh Radhakrishnan,et al.  Reenactment for Read-Committed Snapshot Isolation , 2016, CIKM.

[6]  Torsten Grust,et al.  Let SQL drive the XQuery workhorse (XQuery join graph isolation) , 2010, EDBT '10.

[7]  Boris Glavic,et al.  Optimizing Provenance Computations , 2017, ArXiv.

[8]  Christian S. Jensen,et al.  Adaptable query optimization and evaluation in temporal middleware , 2001, SIGMOD '01.

[9]  James Cheney,et al.  Query shredding: efficient relational evaluation of queries over nested multisets , 2014, SIGMOD Conference.

[10]  Hamid Pirahesh,et al.  Complex query decorrelation , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[11]  Bertram Ludäscher,et al.  Declarative Datalog Debugging for Mere Mortals , 2012, Datalog.

[12]  Parag Agrawal,et al.  Trio: a system for data, uncertainty, and lineage , 2006, VLDB.

[13]  Goetz Graefe,et al.  The Volcano optimizer generator: extensibility and efficient search , 1993, Proceedings of IEEE 9th International Conference on Data Engineering.

[14]  Venkatesh Radhakrishnan,et al.  Interoperability for Provenance-aware Databases using PROV and JSON , 2015, TaPP.

[15]  Daniel Deutch,et al.  On provenance minimization , 2012, TODS.

[16]  Gustavo Alonso,et al.  Using SQL for Efficient Generation and Querying of Provenance Information , 2013, In Search of Elegance in the Theory and Practice of Computation.

[17]  Gustavo Alonso,et al.  Provenance for nested subqueries , 2009, EDBT '09.

[18]  James Cheney,et al.  Provenance in Databases: Why, How, and Where , 2009, Found. Trends Databases.

[19]  Val Tannen,et al.  Collaborative data sharing via update exchange and provenance , 2013, TODS.

[20]  Vikas Arora,et al.  Native Xquery processing in oracle XMLDB , 2005, SIGMOD '05.

[21]  Bertram Ludäscher,et al.  A SQL-Middleware Unifying Why and Why-Not Provenance for First-Order Queries , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[22]  Dieter Gawlick,et al.  A Generic Provenance Middleware for Database Queries, Updates, and Transactions , 2014 .

[23]  Wang Chiew Tan,et al.  An annotation management system for relational databases , 2004, The VLDB Journal.

[24]  Adriane Chapman,et al.  Efficient provenance storage , 2008, SIGMOD Conference.

[25]  Todd J. Green,et al.  LogicBlox, Platform and Language: A Tutorial , 2012, Datalog.

[26]  Bertram Ludäscher,et al.  Efficient provenance storage over nested data collections , 2009, EDBT '09.