Combining Graph Exploration and Fragmentation for Scalable RDF Query Processing

The flexibility offered by the Resource Description Framework (RDF) has led it to become a very popular standard for representing data with an undefined or variable schema using the concept of triples. Its success has resulted in many large scale multidisciplinary datasets, that have prompted the development of efficient RDF processing systems. Current approaches can be distinguished into two groups: the first, adopting the relational model storing the triples in tables, and the second creating data structures that model RDF data as a graph. The strategies of the first group are more easily scalable since they apply optimization strategies from the relational model like indexing and fragmentation. However, these approaches suffer many overheads when dealing with complex queries (e.g. compounded SPARQL graphs involving filters) persistent in existing applications. On the other hand, graph-based systems that use more complex data structures fail to efficiently manage the main memory and are not scalable in computer hardware with limited resources. In this paper, we propose a novel approach to perform queries (Basic Graph Patterns, Wildcards, Aggregations and Sorting) on RDF data. We propose to combine both RDF graph exploration with physical fragmentation of triples. In this work, we describe our graph-based storage and query evaluation models. Then, we detail the architecture of our system and we largely explain the strategy, based in the Volcano execution model, used to manage the main memory at query runtime. We conducted extensive experiments on synthetic and real datasets to evaluate the efficiency of our proposal. We compared our performance with a relational-based (Virtuoso), a graph-based (gStore) and an intensive-indexing (RDF-3X) approach. According to our evaluation, our system offers the best compromise between efficient query processing and scalability.

[1]  Daniel J. Abadi,et al.  Scalable SPARQL querying of large RDF graphs , 2011, Proc. VLDB Endow..

[2]  Georg Lausen,et al.  S2X: Graph-Parallel Querying of RDF with GraphX , 2015, Big-O/DMAH@VLDB.

[3]  Richard E. Schantz,et al.  Clause-iteration with MapReduce to scalably query datagraphs in the SHARD graph-store , 2011, DIDC '11.

[4]  Georg Lausen,et al.  PigSPARQL: mapping SPARQL to Pig Latin , 2011, SWIM '11.

[5]  Georg Lausen,et al.  S2RDF: RDF Querying with SPARQL on Spark , 2015, Proc. VLDB Endow..

[6]  Goetz Graefe,et al.  Volcano - An Extensible and Parallel Query Evaluation System , 1994, IEEE Trans. Knowl. Data Eng..

[7]  Brian McBride,et al.  Jena: A Semantic Web Toolkit , 2002, IEEE Internet Comput..

[8]  Lei Zou,et al.  gStore: Answering SPARQL Queries via Subgraph Matching , 2011, Proc. VLDB Endow..

[9]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[10]  Ladjel Bellatreche,et al.  RDFPartSuite: Bridging Physical and Logical RDF Partitioning , 2019, DaWaK.

[11]  Ioannis Konstantinou,et al.  H2RDF+: High-performance distributed joins over large-scale RDF graphs , 2013, 2013 IEEE International Conference on Big Data.

[12]  Panos Kalnis,et al.  Accelerating SPARQL queries by exploiting hash-based locality and adaptive partitioning , 2016, The VLDB Journal.

[13]  Patrick Lincoln,et al.  Efficient implementation of lattice operations , 1989, TOPL.

[14]  Guido Moerkotte,et al.  Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[15]  Haixun Wang,et al.  A Distributed Graph Engine for Web Scale RDF Data , 2013, Proc. VLDB Endow..

[16]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[17]  Juan Miguel Gómez,et al.  Knowledge management in biomedical libraries: A semantic web approach , 2009, Inf. Syst. Frontiers.

[18]  Gerhard Weikum,et al.  RDF-3X: a RISC-style engine for RDF , 2008, Proc. VLDB Endow..

[19]  Peng Peng,et al.  Processing SPARQL queries over distributed RDF graphs , 2014, The VLDB Journal.

[20]  Lei Zou,et al.  gStore: a graph-based SPARQL query engine , 2014, The VLDB Journal.

[21]  Ladjel Bellatreche,et al.  Query Optimization for Large Scale Clustered RDF Data , 2020, DOLAP.

[22]  Daniel J. Abadi,et al.  Scalable Semantic Web Data Management Using Vertical Partitioning , 2007, VLDB.

[23]  Frank van Harmelen,et al.  Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema , 2002, SEMWEB.

[24]  Abraham Bernstein,et al.  Hexastore: sextuple indexing for semantic web data management , 2008, Proc. VLDB Endow..

[25]  Krys J. Kochut,et al.  BRAHMS: A WorkBench RDF Store and High Performance Memory System for Semantic Association Discovery , 2005, SEMWEB.

[26]  Steffen Staab,et al.  SPLENDID: SPARQL Endpoint Federation Exploiting VOID Descriptions , 2011, COLD.

[27]  Uwe Deppisch,et al.  S-tree: a dynamic balanced signature index for office retrieval , 1986, SIGIR '86.

[28]  Orri Erling,et al.  Virtuoso, a Hybrid RDBMS/Graph Column Store , 2012, IEEE Data Eng. Bull..

[29]  Marcelo Arenas,et al.  Semantics and Complexity of SPARQL , 2006, International Semantic Web Conference.

[30]  James A. Hendler,et al.  BitMat: A Main-memory Bit Matrix of RDF Triples for Conjunctive Triple Pattern Queries , 2008, SEMWEB.

[31]  Jens Lehmann,et al.  DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia , 2015, Semantic Web.

[32]  V. S. Subrahmanian,et al.  GRIN: A Graph Based RDF Index , 2007, AAAI.

[33]  Richard Cyganiak,et al.  A relational algebra for SPARQL , 2005 .

[34]  William Smith,et al.  Semantic catalog of things, services, and data to support a wind data management facility , 2016, Inf. Syst. Frontiers.

[35]  Eleni Fotopoulou,et al.  Challenges and opportunities in renovating public sector information by enabling linked data and analytics , 2016, Information Systems Frontiers.

[36]  Dave Reynolds,et al.  Efficient RDF Storage and Retrieval in Jena2 , 2003, SWDB.

[37]  Martin Theobald,et al.  TriAD: a distributed shared-nothing RDF engine based on asynchronous message passing , 2014, SIGMOD Conference.

[38]  Haofen Wang,et al.  HadoopRDF: A Scalable Semantic Data Analytical Engine , 2012, ICIC.

[39]  Muhammad Saleem,et al.  HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation , 2014, ESWC.