Executing Provenance-Enabled Queries over Web Data

The proliferation of heterogeneous Linked Data on the Web poses new challenges to database systems. In particular, because of this heterogeneity, the capacity to store, track, and query provenance data is becoming a pivotal feature of modern triple stores. In this paper, we tackle the problem of efficiently executing provenance-enabled queries over RDF data. We propose, implement and empirically evaluate five different query execution strategies for RDF queries that incorporate knowledge of provenance. The evaluation is conducted on Web Data obtained from two different Web crawls (The Billion Triple Challenge, and the Web Data Commons). Our evaluation shows that using an adaptive query materialization execution strategy performs best in our context. Interestingly, we find that because provenance is prevalent within Web Data and is highly selective, it can be used to improve query processing performance. This is a counterintuitive result as provenance is often associated with additional overhead.

[1]  Venkatesh Radhakrishnan,et al.  A Generic Provenance Middleware for Queries, Updates, and Transactions , 2014, TAPP.

[2]  David J. DeWitt,et al.  Efficient mid-query re-optimization of sub-optimal query execution plans , 1998, SIGMOD '98.

[3]  James Cheney,et al.  Dynamic Provenance for SPARQL Updates , 2014, International Semantic Web Conference.

[4]  Simon Miles Electronically Querying for the Provenance of Entities , 2006, IPAW.

[5]  Paul T. Groth,et al.  Provenance: An Introduction to PROV , 2013, Provenance.

[6]  Grigoris Antoniou,et al.  Provenance for SPARQL queries , 2012, International Semantic Web Conference.

[7]  Gustavo Alonso,et al.  The perm provenance management system in action , 2009, SIGMOD Conference.

[8]  Susan B. Davidson,et al.  Zoom*UserViews: Querying Relevant Provenance in Workflow Systems , 2007, VLDB.

[9]  Heiko Paulheim,et al.  Adoption of the Linked Data Best Practices in Different Topical Domains , 2014, SEMWEB.

[10]  Kristina Lerman,et al.  Semi-automatically Mapping Structured Sources into the Semantic Web , 2012, ESWC.

[11]  Paul T. Groth,et al.  Querying neXtProt nanopublications and their value for insights on sequence variants and tissue expression , 2014, J. Web Semant..

[12]  Gerhard Weikum,et al.  Scalable join processing on very large RDF graphs , 2009, SIGMOD Conference.

[13]  Joseph M. Hellerstein,et al.  Eddies: continuously adaptive query processing , 2000, SIGMOD '00.

[14]  Shiyong Lu,et al.  RDFProv: A relational RDF store for querying and managing scientific workflow provenance , 2010, Data Knowl. Eng..

[15]  Philippe Cudré-Mauroux,et al.  dipLODocus[RDF] - Short and Long-Tail RDF Analytics for Massive Webs of Data , 2011, SEMWEB.

[16]  Vassilis Christophides,et al.  Coloring RDF Triples to Capture Provenance , 2009, SEMWEB.

[17]  G. G. Meyer,et al.  Lecture notes in business information processing , 2009 .

[18]  Paul T. Groth,et al.  TripleProv: efficient processing of lineage queries in a native RDF store , 2014, WWW.

[19]  Carlo Curino,et al.  OLTP-Bench: An Extensible Testbed for Benchmarking Relational Databases , 2013, Proc. VLDB Endow..

[20]  Luc Moreau,et al.  PROV-Overview. An Overview of the PROV Family of Documents , 2013 .

[21]  Karen Ward,et al.  Dynamic query evaluation plans , 1989, SIGMOD '89.

[22]  Val Tannen,et al.  Querying data provenance , 2010, SIGMOD Conference.

[23]  Gianluca Demartini,et al.  BowlognaBench - Benchmarking RDF Analytics , 2011, SIMPDA.

[24]  Richard R. Muntz,et al.  Dynamic query re-optimization , 1999, Proceedings. Eleventh International Conference on Scientific and Statistical Database Management.

[25]  Jeremy J. Carroll,et al.  Named graphs, provenance and trust , 2005, WWW '05.

[26]  Alexander Zeier,et al.  HYRISE - A Main Memory Hybrid Storage Engine , 2010, Proc. VLDB Endow..

[27]  Robert Isele,et al.  LDIF - Linked Data Integration Framework , 2011, COLD.

[28]  Goetz Graefe,et al.  Optimization of dynamic query evaluation plans , 1994, SIGMOD '94.

[29]  Fabien L. Gandon,et al.  Predicting SPARQL Query Performance , 2014, ESWC.

[30]  Tim Berners-Lee,et al.  Linked data , 2020, Semantic Web for the Working Ontologist.

[31]  Egon L. Willighagen,et al.  Scientific Lenses to Support Multiple Views over Linked Chemistry Data , 2014, SEMWEB.

[32]  Christian Bizer,et al.  Web Data Commons - Extracting Structured Data from Two Large Web Corpora , 2012, LDOW.

[33]  Vassilis Christophides,et al.  Algebraic structures for capturing the provenance of SPARQL queries , 2013, ICDT '13.

[34]  Diego Reforgiato Recupero,et al.  Annotated RDF , 2006, TOCL.

[35]  Umberto Straccia,et al.  A General Framework for Representing, Reasoning and Querying with Annotated Semantic Web Data , 2011, J. Web Semant..

[36]  Tom Heath,et al.  Linked Data: Evolving the Web into a Global Data Space , 2011, Linked Data.