Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase

Provenance, which records the history of an in-silico experiment, has been identified as an important requirement for scientific workflows to support scientific discovery reproducibility, result interpretation, and problem diagnosis. Large provenance datasets are composed of many smaller provenance graphs, each of which corresponds to a single workflow execution. In this work, we explore and address the challenge of efficient and scalable storage and querying of large collections of provenance graphs serialized as RDF graphs in an Apache HBase database. Specifically, we propose: (i) novel storage and indexing techniques for RDF data in HBase that are better suited for provenance datasets rather than generic RDF graphs and (ii) novel SPARQL query evaluation algorithms that solely rely on indices to compute expensive join operations, make use of numeric values that represent triple positions rather than actual triples, and eliminate the need for intermediate data transfers over a network. The empirical evaluation of our algorithms using provenance datasets and queries of the University of Texas Provenance Benchmark confirms that our approach is efficient and scalable.

[1]  Daniel J. Abadi,et al.  Scalable SPARQL querying of large RDF graphs , 2011, Proc. VLDB Endow..

[2]  Carole A. Goble,et al.  Mining Taverna's semantic web of provenance , 2008, Concurr. Comput. Pract. Exp..

[3]  Georg Lausen,et al.  PigSPARQL: mapping SPARQL to Pig Latin , 2011, SWIM '11.

[4]  Bertram Ludäscher,et al.  Provenance in Scientific Workflow Systems , 2007, IEEE Data Eng. Bull..

[5]  Sang-goo Lee,et al.  SPARQL basic graph pattern processing with iterative MapReduce , 2010, MDAC '10.

[6]  Frank van Harmelen,et al.  Scalable Distributed Reasoning Using MapReduce , 2009, SEMWEB.

[7]  Gregor von Laszewski,et al.  Swift: Fast, Reliable, Loosely Coupled Parallel Computation , 2007, 2007 IEEE Congress on Services (Services 2007).

[8]  James A. Hendler,et al.  Matrix "Bit" loaded: a scalable lightweight join query processor for RDF data , 2010, WWW '10.

[9]  Yogesh L. Simmhan,et al.  The Open Provenance Model core specification (v1.1) , 2011, Future Gener. Comput. Syst..

[10]  Padmashree Ravindra,et al.  Towards scalable RDF graph analytics on MapReduce , 2010, MDAC '10.

[11]  Xiang Lian,et al.  UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems , 2012, 2012 IEEE Eighth World Congress on Services.

[12]  Shiyong Lu,et al.  RDFProv: A relational RDF store for querying and managing scientific workflow provenance , 2010, Data Knowl. Eng..

[13]  Yolanda Gil,et al.  Provenance trails in the Wings-Pegasus system , 2008 .

[14]  Yogesh L. Simmhan,et al.  A survey of data provenance in e-science , 2005, SGMD.

[15]  Edward A. Lee,et al.  CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2000; 00:1–7 Prepared using cpeauth.cls [Version: 2002/09/19 v2.02] Taverna: Lessons in creating , 2022 .

[16]  Bertram Ludäscher,et al.  Scientific Workflows and Provenance: Introduction and Research Opportunities , 2012, Datenbank-Spektrum.

[17]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[18]  Cláudio T. Silva,et al.  Managing the Evolution of Dataflows with VisTrails , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[19]  Bhavani M. Thuraisingham,et al.  Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[20]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[21]  Jing Hua,et al.  A Reference Architecture for Scientific Workflow Management Systems and the VIEW SOA Solution , 2009, IEEE Transactions on Services Computing.

[22]  John Abraham,et al.  Distributed Semantic Web Data Management in HBase and MySQL Cluster , 2011, 2011 IEEE 4th International Conference on Cloud Computing.

[23]  Juliana Freire,et al.  Provenance and scientific workflows: challenges and opportunities , 2008, SIGMOD Conference.