Dynamic Data Exchange in Distributed RDF Stores

When RDF datasets become too large to be managed by centralised systems, they are often distributed in a cluster of shared-nothing servers, and queries are answered using a distributed join algorithm. Although such solutions have been extensively studied in relational and RDF databases, we argue that existing approaches exhibit two drawbacks. First, they usually decide statically(i.e., at query compile time) how to shuffle the data, which can lead to missed opportunities for local computation. Second, they often materialise large intermediate relations whose size is determined by the entire dataset (and not the data stored in each server), so these relations can easily exceed the memory of individual servers. As a possible remedy, we present a novel distributed join algorithm for RDF. Our approach decides when to shuffle data dynamically, which ensures that query answers that can be wholly produced within a server involve only local computation. It also uses a novel flow control mechanism to ensure that every query can be answered even if each server has a bounded amount of memory that is much smaller than the intermediate relations. We complement our algorithm with a new query planning approach that balances the cost of communication against the cost of local processing at each server. Moreover, as in several existing approaches, we distribute RDF data using graph partitioning so as to maximise local computation, but we refine the partitioning algorithm to produce more balanced partitions. We show empirically that our techniques can outperform the state of the art by orders of magnitude in terms of query evaluation times, network communication, and memory use. In particular, bounding the memory use in individual servers can mean the difference between success and failure for answering queries with large answer sets.

[1]  Haixun Wang,et al.  A Distributed Graph Engine for Web Scale RDF Data , 2013, Proc. VLDB Endow..

[2]  Jürgen Umbrich,et al.  YARS2: A Federated Repository for Querying Graph Structured Data from the Web , 2007, ISWC/ASWC.

[3]  James K. Mullin,et al.  Optimal Semijoins for Distributed Database Systems , 1990, IEEE Trans. Software Eng..

[4]  Boris Motik,et al.  Estimating the Cardinality of Conjunctive Queries over RDF Data Using Graph Summarisation , 2018, WWW.

[5]  Georg Lausen,et al.  PigSPARQL: A SPARQL Query Processing Baseline for Big Data , 2013, International Semantic Web Conference.

[6]  N. Shadbolt,et al.  4store: The Design and Implementation of a Clustered RDF Store , 2009 .

[7]  Gerhard Weikum,et al.  The RDF-3X engine for scalable management of RDF data , 2010, The VLDB Journal.

[8]  Richard E. Schantz,et al.  Clause-iteration with MapReduce to scalably query datagraphs in the SHARD graph-store , 2011, DIDC '11.

[9]  Gerhard Weikum,et al.  Scalable join processing on very large RDF graphs , 2009, SIGMOD Conference.

[10]  Georg Lausen,et al.  Sempala: Interactive SPARQL Query Processing on Hadoop , 2014, SEMWEB.

[11]  Jakub Závodný,et al.  Size Bounds for Factorised Representations of Query Results , 2015, TODS.

[12]  Bhavani M. Thuraisingham,et al.  Heuristics-Based Query Processing for Large RDF Graphs Using Cloud Computing , 2011, IEEE Transactions on Knowledge and Data Engineering.

[13]  Masatoshi Yoshikawa,et al.  Query processing for distributed databases using generalized semi-joins , 1982, SIGMOD '82.

[14]  Pablo de la Fuente,et al.  An Empirical Study of Real-World SPARQL Queries , 2011, ArXiv.

[15]  Sumit Ganguly,et al.  Query optimization for parallel execution , 1992, SIGMOD '92.

[16]  Abraham Bernstein,et al.  Hexastore: sextuple indexing for semantic web data management , 2008, Proc. VLDB Endow..

[17]  Panos Kalnis,et al.  Accelerating SPARQL queries by exploiting hash-based locality and adaptive partitioning , 2016, The VLDB Journal.

[18]  Guang Yang,et al.  Dynamic and fast processing of queries on large-scale RDF data , 2014, Knowledge and Information Systems.

[19]  Georg Lausen,et al.  S2RDF: RDF Querying with SPARQL on Spark , 2015, Proc. VLDB Endow..

[20]  Kenneth A. Ross,et al.  Track join: distributed joins with minimal network traffic , 2014, SIGMOD Conference.

[21]  Panos Kalnis,et al.  A Survey and Experimental Comparison of Distributed SPARQL Engines for Very Large RDF Data , 2017, Proc. VLDB Endow..

[22]  Goetz Graefe,et al.  Encapsulation of Parallelism and Architecture-Independence in Extensible Database Query Execution , 1993, IEEE Trans. Software Eng..

[23]  Boris Motik,et al.  Querying Distributed RDF Graphs: The Effects of Partitioning , 2014, SSWS@ISWC.

[24]  Steffen Staab,et al.  SPLENDID: SPARQL Endpoint Federation Exploiting VOID Descriptions , 2011, COLD.

[25]  François Goasdoué,et al.  CliqueSquare: Flat plans for massively parallel RDF queries , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[26]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[27]  James A. Hendler,et al.  Matrix "Bit" loaded: a scalable lightweight join query processor for RDF data , 2010, WWW '10.

[28]  M. Tamer Özsu,et al.  Diversified Stress Testing of RDF Data Management Systems , 2014, SEMWEB.

[29]  Katja Hose,et al.  FedX: Optimization Techniques for Federated Query Processing on Linked Data , 2011, SEMWEB.

[30]  Peng Peng,et al.  Processing SPARQL queries over distributed RDF graphs , 2014, The VLDB Journal.

[31]  Daniel J. Abadi,et al.  Scalable SPARQL querying of large RDF graphs , 2011, Proc. VLDB Endow..

[32]  Ling Liu,et al.  Scaling Queries over Big RDF Graphs with Semantic Hash Partitioning , 2013, Proc. VLDB Endow..

[33]  Martin Theobald,et al.  TriAD: a distributed shared-nothing RDF engine based on asynchronous message passing , 2014, SIGMOD Conference.

[34]  Katja Hose,et al.  Partout: a distributed engine for efficient RDF processing , 2012, WWW.

[35]  Jeff Heflin,et al.  LUBM: A benchmark for OWL knowledge base systems , 2005, J. Web Semant..

[36]  Hai Jin,et al.  SemStore: A Semantic-Preserving Distributed RDF Triple Store , 2014, CIKM.

[37]  Ioannis Konstantinou,et al.  H2RDF+: High-performance distributed joins over large-scale RDF graphs , 2013, 2013 IEEE International Conference on Big Data.

[38]  Sherif Sakr,et al.  DREAM: Distributed RDF Engine with Adaptive Query Planner and Minimal Communication , 2015, Proc. VLDB Endow..

[39]  Yavor Nenov,et al.  Parallel Materialisation of Datalog Programs in Centralised, Main-Memory RDF Systems , 2014, AAAI.

[40]  Katja Hose,et al.  WARP: Workload-aware replication and partitioning for RDF , 2013, 2013 IEEE 29th International Conference on Data Engineering Workshops (ICDEW).

[41]  Donald Kossmann,et al.  The state of the art in distributed query processing , 2000, CSUR.

[42]  Georg Lausen,et al.  S2X: Graph-Parallel Querying of RDF with GraphX , 2015, Big-O/DMAH@VLDB.

[43]  Min Wang,et al.  EAGRE: Towards scalable I/O efficient SPARQL query evaluation on the cloud , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[44]  Martin L. Kersten,et al.  Column-store support for RDF data management: not all swans are white , 2008, Proc. VLDB Endow..