Provenance for nested subqueries

Data provenance is essential in applications such as scientific computing, curated databases, and data warehouses. Several systems have been developed that provide provenance functionality for the relational data model. These systems support only a subset of SQL, a severe limitation in practice since most of the application domains that benefit from provenance information use complex queries. Such queries typically involve nested subqueries, aggregation and/or user defined functions. Without support for these constructs, a provenance management system is of limited use. In this paper we address this limitation by exploring the problem of provenance derivation when complex queries are involved. More precisely, we demonstrate that the widely used definition of Why-provenance fails in the presence of nested subqueries, and show how the definition can be modified to produce meaningful results for nested subqueries. We further present query rewrite rules to transform an SQL query into a query propagating provenance. The solution introduced in this paper allows us to track provenance information for a far wider subset of SQL than any of the existing approaches. We have incorporated these ideas into the Perm provenance management system engine and used it to evaluate the feasibility and performance of our approach.

[1]  Parag Agrawal,et al.  Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS (Demo) , 2007, CIDR.

[2]  Bruce Momjian,et al.  PostgreSQL: Introduction and Concepts , 2000 .

[3]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.

[4]  Gustavo Alonso,et al.  Perm: Processing Provenance and Data on the Same Data Model through Query Rewriting , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[5]  Antonio Badia,et al.  SQL query optimization through nested relational algebra , 2007, TODS.

[6]  Klaus R. Dittrich,et al.  Data Provenance: A Categorization of Existing Approaches , 2007, BTW.

[7]  Torsten Grabs,et al.  Execution strategies for SQL subqueries , 2007, SIGMOD '07.

[8]  James Cheney,et al.  On the expressiveness of implicit provenance in query and update languages , 2007, TODS.

[9]  Val Tannen,et al.  Provenance semirings , 2007, PODS.

[10]  Floris Geerts,et al.  MONDRIAN: Annotating and Querying Databases through Colors and Blocks , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[11]  Wang Chiew Tan Provenance in Databases: Past, Current, and Future , 2007, IEEE Data Eng. Bull..

[12]  Surajit Chaudhuri,et al.  An overview of query optimization in relational systems , 1998, PODS.

[13]  Won Kim,et al.  On optimizing an SQL-like nested query , 1982, TODS.

[14]  M. Muralikrishna,et al.  Improved Unnesting Algorithms for Join Aggregate SQL Queries , 1992, VLDB.

[15]  James Cheney,et al.  On the expressiveness of implicit provenance in query and update languages , 2008, TODS.

[16]  Umeshwar Dayal,et al.  Processing queries with quantifiers a horticultural approach , 1983, PODS.

[17]  Yogesh L. Simmhan,et al.  A survey of data provenance in e-science , 2005, SGMD.

[18]  Jennifer Widom,et al.  Tracing the lineage of view data in a warehousing environment , 2000, TODS.

[19]  Hamid Pirahesh,et al.  Cost-based optimization for magic: algebra and implementation , 1996, SIGMOD '96.

[20]  Wang Chiew Tan,et al.  DBNotes: a post-it system for relational databases based on provenance , 2005, SIGMOD '05.

[21]  Michael H. Böhlen,et al.  Efficient computation of subqueries in complex OLAP , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).