Provenance as dependency analysis†

Provenance is information recording the source, derivation or history of some information. Provenance tracking has been studied in a variety of settings, particularly database management systems. However, although many candidate definitions of provenance have been proposed, the mathematical or semantic foundations of data provenance have received comparatively little attention. In this paper, we argue that dependency analysis techniques familiar from program analysis and program slicing provide a formal foundation for forms of provenance that are intended to show how (part of) the output of a query depends on (parts of) its input. We introduce a semantic characterisation of such dependency provenance for a core database query language, show that minimal dependency provenance is not computable, and provide dynamic and static approximation techniques. We also discuss preliminary implementation experience with using dependency provenance to compute data slices, or summaries of the parts of the input relevant to a given part of the output.

[1]  Guy E. Blelloch,et al.  Selective memoization , 2003, POPL '03.

[2]  Flemming Nielson,et al.  Principles of Program Analysis , 1999, Springer Berlin Heidelberg.

[3]  Val Tannen,et al.  Annotated XML: queries and provenance , 2008, PODS.

[4]  Yogesh L. Simmhan,et al.  A survey of data provenance in e-science , 2005, SGMD.

[5]  James Cheney,et al.  Provenance as Dependency Analysis , 2007, DBPL.

[6]  Peter Buneman,et al.  Semistructured data , 1997, PODS.

[7]  Jennifer Widom,et al.  Tracing the lineage of view data in a warehousing environment , 2000, TODS.

[8]  Andrew C. Myers,et al.  Language-based information-flow security , 2003, IEEE J. Sel. Areas Commun..

[9]  Umut A. Acar Self-adjusting computation: (an overview) , 2009, PEPM '09.

[10]  Analysis and caching of dependencies , 1996, ICFP '96.

[11]  James Cheney,et al.  Curated databases , 2008, PODS.

[12]  Val Tannen,et al.  Provenance semirings , 2007, PODS.

[13]  Martín Abadi,et al.  A core calculus of dependency , 1999, POPL '99.

[14]  Jacek Sroka,et al.  A Formal Model of Dataflow Repositories , 2007, DILS.

[15]  Limsoon Wong,et al.  Normal Forms and Conservative Extension Properties for Query Languages over Collection Types , 1996, J. Comput. Syst. Sci..

[16]  Michael Stonebraker,et al.  Supporting fine-grained data lineage in a database visualization environment , 1997, Proceedings 13th International Conference on Data Engineering.

[17]  James Cheney,et al.  Provenance in Databases: Why, How, and Where , 2009, Found. Trends Databases.

[18]  Michael Hicks,et al.  Fable: A Language for Enforcing User-defined Security Policies , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[19]  James Frew,et al.  Lineage retrieval for scientific data processing: a survey , 2005, CSUR.

[20]  Geoffrey Smith,et al.  A Sound Type System for Secure Flow Analysis , 1996, J. Comput. Secur..

[21]  Steve Zdancewic,et al.  AURA: a programming language for authorization and audit , 2008, ICFP 2008.

[22]  Patrick Cousot,et al.  Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints , 1977, POPL.

[23]  Wang Chiew Tan,et al.  An annotation management system for relational databases , 2004, The VLDB Journal.

[24]  Scott F. Smith,et al.  Securing information flow via dynamic capture of dependencies , 2008, J. Comput. Secur..

[25]  Gavin M. Bierman,et al.  A theory of typed coercions and its applications , 2009, ICFP.

[26]  James Cheney,et al.  On the expressiveness of implicit provenance in query and update languages , 2008, TODS.

[27]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .

[28]  Limsoon Wong,et al.  Principles of Programming with Complex Objects and Collection Types , 1995, Theor. Comput. Sci..

[29]  Casey O'Callaghan What Is a Sound , 2007 .

[30]  Jens Palsberg,et al.  Type-based analysis and applications , 2001, PASTE '01.

[31]  Clifford Lynch,et al.  Authenticity and Integrity in the Digital Environment: an exploratory analysis of the central role of trust , 2013 .

[32]  Man Lung Yiu,et al.  Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006 , 2006, ICDE 2006.

[33]  Margo I. Seltzer,et al.  Provenance-Aware Storage Systems , 2006, USENIX ATC, General Track.

[34]  Frank Tip,et al.  Dynamic dependence in term rewriting systems and its application to program slicing , 1994, Inf. Softw. Technol..

[35]  Carl A. Gunter,et al.  Dynamic slicing in higher-order programming languages , 1997 .

[36]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.

[37]  Stuart E. Madnick,et al.  A Polygen Model for Heterogeneous Database Systems: The Source Tagging Perspective , 1990, VLDB.

[38]  James Cheney,et al.  Provenance management in curated databases , 2006, SIGMOD Conference.

[39]  David W. Binkley,et al.  Program slicing , 2008, 2008 Frontiers of Software Maintenance.

[40]  James Cheney,et al.  Functional programs that explain their work , 2012, ICFP.

[41]  Floris Geerts,et al.  MONDRIAN: Annotating and Querying Databases through Colors and Blocks , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[42]  Umut A. Acar,et al.  Imperative self-adjusting computation , 2008, POPL '08.

[43]  Dan Suciu,et al.  Comprehension syntax , 1994, SGMD.

[44]  Sanjeev Khanna,et al.  Edinburgh Research Explorer On the Propagation of Deletions and Annotations through Views , 2013 .

[45]  Andrew C. Myers,et al.  JFlow: practical mostly-static information flow control , 1999, POPL '99.

[46]  VolpanoDennis,et al.  A sound type system for secure flow analysis , 1996 .

[47]  J. B. Ward Principles of programming , 1956, Electrical Engineering.

[48]  James Cheney,et al.  Program Slicing and Data Provenance , 2007, IEEE Data Eng. Bull..

[49]  Ian Foster,et al.  Special Issue: The First Provenance Challenge , 2008 .

[50]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.

[51]  Frank Tip,et al.  Dynamic dependence in term rewriting systems and its application to program slicing , 1998, Inf. Softw. Technol..

[52]  Ian Foster,et al.  The First Provenance Challenge , 2008 .

[53]  Philip Wadler,et al.  Comprehending monads , 1990, Mathematical Structures in Computer Science.