TripleProv: efficient processing of lineage queries in a native RDF store

Given the heterogeneity of the data one can find on the Linked Data cloud, being able to trace back the provenance of query results is rapidly becoming a must-have feature of RDF systems. While provenance models have been extensively discussed in recent years, little attention has been given to the efficient implementation of provenance-enabled queries inside data stores. This paper introduces TripleProv: a new system extending a native RDF store to efficiently handle such queries. TripleProv implements two different storage models to physically co-locate lineage and instance data, and for each of them implements algorithms for tracing provenance at two granularity levels. In the following, we present the overall architecture of our system, its different lineage storage models, and the various query execution strategies we have implemented to efficiently answer provenance-enabled queries. In addition, we present the results of a comprehensive empirical evaluation of our system over two different datasets and workloads.

[1]  Michael Stonebraker,et al.  A Demonstration of SciDB: A Science-Oriented DBMS , 2009, Proc. VLDB Endow..

[2]  Philippe Cudré-Mauroux,et al.  dipLODocus[RDF] - Short and Long-Tail RDF Analytics for Massive Webs of Data , 2011, SEMWEB.

[3]  Kevin Wilkinson,et al.  Jena Property Table Implementation , 2006 .

[4]  Deborah L. McGuinness,et al.  Tracking RDF Graph Provenance using RDF Molecules , 2005 .

[5]  Luc Moreau,et al.  PROV-Overview. An Overview of the PROV Family of Documents , 2013 .

[6]  Olaf Hartig,et al.  Querying Trust in RDF Data with tSPARQL , 2009, ESWC.

[7]  Gerhard Weikum,et al.  YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia: Extended Abstract , 2013, IJCAI.

[8]  Vassilis Christophides,et al.  On Provenance of Queries on Semantic Web Data , 2011, IEEE Internet Computing.

[9]  Vassilis Christophides,et al.  On Explicit Provenance Management in RDF/S Graphs , 2009, Workshop on the Theory and Practice of Provenance.

[10]  Paul T. Groth Transparency and Reliability in the Data Supply Chain , 2013, IEEE Internet Computing.

[11]  Vassilis Christophides,et al.  Coloring RDF Triples to Capture Provenance , 2009, SEMWEB.

[12]  Umberto Straccia,et al.  A General Framework for Representing, Reasoning and Querying with Annotated Semantic Web Data , 2011, J. Web Semant..

[13]  Val Tannen,et al.  Provenance semirings , 2007, PODS.

[14]  Val Tannen,et al.  Querying data provenance , 2010, SIGMOD Conference.

[15]  Jeremy J. Carroll,et al.  Named graphs, provenance and trust , 2005, WWW '05.

[16]  Gerhard Weikum,et al.  Scalable join processing on very large RDF graphs , 2009, SIGMOD Conference.

[17]  Luc Moreau,et al.  The Foundations for Provenance on the Web , 2010, Found. Trends Web Sci..

[18]  Paul T. Groth,et al.  Requirements for Provenance on the Web , 2012, Int. J. Digit. Curation.

[19]  Yogesh L. Simmhan,et al.  The Open Provenance Model core specification (v1.1) , 2011, Future Gener. Comput. Syst..

[20]  Grigoris Antoniou,et al.  Provenance for SPARQL queries , 2012, International Semantic Web Conference.

[21]  James Cheney,et al.  Provenance in Databases: Why, How, and Where , 2009, Found. Trends Databases.

[22]  Christian Bizer,et al.  Web Data Commons - Extracting Structured Data from Two Large Web Corpora , 2012, LDOW.

[23]  Olaf Hartig Provenance Information in the Web of Data , 2009, LDOW.

[24]  Diego Reforgiato Recupero,et al.  Annotated RDF , 2006, TOCL.

[25]  Paul Groth,et al.  PROV Implementation Report , 2013 .

[26]  Vassilis Christophides,et al.  Algebraic structures for capturing the provenance of SPARQL queries , 2013, ICDT '13.