On data placement strategies in distributed RDF stores

In the last years, scalable RDF stores in the cloud have been developed, where graph data is distributed over compute and storage nodes for scaling efforts of query processing and memory needs. One main challenge in these RDF stores is the data placement strategy that can be formalized in terms of graph covers. These graph covers determine whether (a) different query results may be computed on several compute nodes in parallel (vertical parallelization) and (b) individual query results can be produced only from triples assigned to few --- ideally one --- storage node (horizontal containment). We analyse the impact of three most commonly used graph cover strategies in these terms and found out that balancing query workload reduces the query execution time more than reducing data transfer over network. To this end, we present our novel benchmark and open source evaluation platform.

[1]  Thomas Eiter,et al.  Reasoning Web. Semantic Technologies for Advanced Query Answering , 2012, Lecture Notes in Computer Science.

[2]  Steffen Staab,et al.  Evaluating SPARQL 1.1 Property Path Support , 2017, BLINK/NLIWoD3@ISWC.

[3]  Neil D. Jones,et al.  An introduction to partial evaluation , 1996, CSUR.

[4]  Günter Ladwig,et al.  FedBench: A Benchmark Suite for Federated Semantic Data Query Processing , 2011, SEMWEB.

[5]  Panos Kalnis,et al.  Evaluating SPARQL Queries on Massive RDF Datasets , 2015, Proc. VLDB Endow..

[6]  Dean Allemang Linked Data: Storing, Querying, and Reasoning. Sakr, Sherif, Wylot, Marcin, Mutharaju, Raghava, Le Phuoc, Danh, and Fundulaki, Irini. Cham, Switzerland: Springer International Publishing, 2018. 233 pp. $129.00 (hardcover). (ISBN 9783319735146) , 2019, J. Assoc. Inf. Sci. Technol..

[7]  Boris Motik,et al.  Querying Distributed RDF Graphs: The Effects of Partitioning , 2014, SSWS@ISWC.

[8]  Pablo Rodriguez,et al.  Divide and Conquer: Partitioning Online Social Networks , 2009, ArXiv.

[9]  Maribel Acosta,et al.  ANAPSID: An Adaptive Query Processing Engine for SPARQL Endpoints , 2011, SEMWEB.

[10]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[11]  Marcelo Arenas,et al.  Federation and Navigation in SPARQL 1.1 , 2012, Reasoning Web.

[12]  Markus Krötzsch,et al.  Getting the Most Out of Wikidata: Semantic Technology Usage in Wikipedia's Knowledge Graph , 2018, SEMWEB.

[13]  Maria-Esther Vidal,et al.  Federated SPARQL Queries Processing with Replicated Fragments , 2015, International Semantic Web Conference.

[14]  Beng Chin Ooi,et al.  The performance of MapReduce , 2010, Proc. VLDB Endow..

[15]  David F. Wood,et al.  Kowari: A Platform for Semantic Web Storage and Analysis , 2005, WWW 2005.

[16]  Katja Hose,et al.  Partout: a distributed engine for efficient RDF processing , 2012, WWW.

[17]  Georg Lausen,et al.  S2X: Graph-Parallel Querying of RDF with GraphX , 2015, Big-O/DMAH@VLDB.

[18]  Rim Faiz,et al.  RDF-4X: a scalable solution for RDF quads store in the cloud , 2016, MEDES.

[19]  Min Wang,et al.  EAGRE: Towards scalable I/O efficient SPARQL query evaluation on the cloud , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[20]  Steffen Staab,et al.  Impact analysis of data placement strategies on query efforts in distributed RDF stores , 2018, J. Web Semant..

[21]  Hector Garcia-Molina,et al.  Semantic Overlay Networks for P2P Systems , 2004, AP2PC.

[22]  Orri Erling,et al.  Towards Web Scale RDF , 2008 .

[23]  Qi Zhang,et al.  Efficient and Customizable Data Partitioning Framework for Distributed Big RDF Data Processing in the Cloud , 2013, 2013 IEEE Sixth International Conference on Cloud Computing.

[24]  Jeff Heflin,et al.  LUBM: A benchmark for OWL knowledge base systems , 2005, J. Web Semant..

[25]  Carmem S. Hara,et al.  Exploring Controlled RDF Distribution , 2016, 2016 IEEE International Conference on Cloud Computing Technology and Science (CloudCom).

[26]  Hai Jin,et al.  SemStore: A Semantic-Preserving Distributed RDF Triple Store , 2014, CIKM.

[27]  Frank van Harmelen,et al.  Marvin: Distributed reasoning over large-scale Semantic Web data , 2009, J. Web Semant..

[28]  Christos Faloutsos,et al.  PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[29]  Christian Schindelhauer,et al.  Effects of Network Structure Improvement on Distributed RDF Querying , 2013, Globe.

[30]  Muhammad Saleem,et al.  FEASIBLE: A Feature-Based SPARQL Benchmark Generation Framework , 2015, SEMWEB.

[31]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[32]  Günter Ladwig,et al.  SIHJoin: Querying Remote and Local Linked Data , 2011, ESWC.

[33]  Abraham Bernstein,et al.  Querying a messy web of data with Avalanche , 2014, J. Web Semant..

[34]  Georg Lausen,et al.  S2RDF: RDF Querying with SPARQL on Spark , 2015, Proc. VLDB Endow..

[35]  Felix Naumann,et al.  Scalable peer-to-peer-based RDF management , 2012, I-SEMANTICS '12.

[36]  Abraham Bernstein,et al.  Distributed SPARQL Throughput Increase: On the effectiveness of Workload-driven RDF partitioning , 2013, International Semantic Web Conference.

[37]  David Jones High performance , 1989, Nature.

[38]  HyeongSik Kim,et al.  From SPARQL to MapReduce: The Journey Using a Nested TripleGroup Algebra , 2011, Proc. VLDB Endow..

[39]  Steffen Staab,et al.  SPLENDID: SPARQL Endpoint Federation Exploiting VOID Descriptions , 2011, COLD.

[40]  Georg Lausen,et al.  PigSPARQL: mapping SPARQL to Pig Latin , 2011, SWIM '11.

[41]  Lei Gai,et al.  SparkRDF: In-Memory Distributed RDF Management Framework for Large-Scale Social Data , 2015, WAIM.

[42]  Bo Zong,et al.  Towards effective partition management for large graphs , 2012, SIGMOD Conference.

[43]  Marcelo Arenas,et al.  Semantics and complexity of SPARQL , 2006, TODS.

[44]  Steffen Staab,et al.  Storing and Querying Semantic Data in the Cloud , 2018, Reasoning Web.

[45]  Andreas Harth,et al.  CumulusRDF: Linked Data Management on Nested Key-Value Stores , 2011 .

[46]  Abraham Bernstein,et al.  TripleRush: A Fast and Scalable Triple Store , 2013, SSWS@ISWC.

[47]  Juliane Freud Tcpip Illustrated Vol 1 The Protocols , 2016 .

[48]  Dongyan Zhao,et al.  Query Workload-based RDF Graph Fragmentation and Allocation , 2016, EDBT.

[49]  Maribel Acosta,et al.  A Heuristic-Based Approach for Planning Federated SPARQL Queries , 2012, COLD.

[50]  Dan Suciu,et al.  Skew in parallel query processing , 2014, PODS.

[51]  Panos Kalnis,et al.  A Demonstration of Lusail: Querying Linked Data at Scale , 2017, SIGMOD Conference.

[52]  Maria-Esther Vidal,et al.  Decomposing federated queries in presence of replicated fragments , 2017, J. Web Semant..

[53]  Haixun Wang,et al.  A Distributed Graph Engine for Web Scale RDF Data , 2013, Proc. VLDB Endow..

[54]  François Goasdoué,et al.  CliqueSquare: Flat plans for massively parallel RDF queries , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[55]  Michael Stonebraker,et al.  Implementation techniques for main memory database systems , 1984, SIGMOD '84.

[56]  Orri Erling,et al.  Virtuoso: RDF Support in a Native RDBMS , 2009, Semantic Web Information Management.

[57]  Yavor Nenov,et al.  Distributed RDF Query Answering with Dynamic Data Exchange , 2016, International Semantic Web Conference.

[58]  Vassilis Christophides,et al.  Semantic Query Routing and Processing in P2P Database Systems: The ICS-FORTH SQPeer Middleware , 2004, EDBT Workshops.

[59]  Pierre Genevès,et al.  SPARQLGX: Efficient Distributed Evaluation of SPARQL with Apache Spark , 2016, International Semantic Web Conference.

[60]  Antonis Troumpoukis,et al.  SemaGrow: optimizing federated SPARQL queries , 2015, SEMANTiCS.

[61]  A. N. Wilschut,et al.  Dataflow query execution in a parallel main-memory environment , 1991, [1991] Proceedings of the First International Conference on Parallel and Distributed Information Systems.

[62]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[63]  François Goasdoué,et al.  SPARQL Query Processing in the Cloud , 2014, Linked Data Management.

[64]  Alberto O. Mendelzon,et al.  Foundations of semantic web databases , 2004, PODS.

[65]  Neil J. Gunther A Simple Capacity Model of Massively Parallel Transaction Systems , 1993, Int. CMG Conference.

[66]  Karl Aberer,et al.  GridVine: Building Internet-Scale Semantic Overlay Networks , 2004, SEMWEB.

[67]  Panos Kalnis,et al.  PHD-Store: An Adaptive SPARQL Engine with Dynamic Partitioning for Distributed RDF Repositories , 2014, ArXiv.

[68]  Chunhua Weng,et al.  Biomedical Data Management and Graph Online Querying , 2015, Lecture Notes in Computer Science.

[69]  Min Wang,et al.  Towards Efficient Join Processing over Large RDF Graph Using MapReduce , 2012, SSDBM.

[70]  Alexander Schätzle,et al.  TriAL-QL: Distributed Processing of Navigational Queries , 2015, AMW.

[71]  Daniel J. Abadi,et al.  Scalable Semantic Web Data Management Using Vertical Partitioning , 2007, VLDB.

[72]  Olivier Curé,et al.  SPARQL Graph Pattern Processing with Apache Spark , 2017, GRADES@SIGMOD/PODS.

[73]  Michael Hausenblas,et al.  Describing linked datasets with the VoID vocabulary , 2011 .

[74]  Hala Skaf-Molli,et al.  The Odyssey Approach for Optimizing Federated SPARQL Queries , 2017, SEMWEB.

[75]  Valentin Zacharias,et al.  RDF on Cloud Number Nine , 2010 .

[76]  Xiaoyong Du,et al.  Efficient SPARQL Query Evaluation via Automatic Data Partitioning , 2013, DASFAA.

[77]  Christian Schindelhauer,et al.  Towards Load Balancing and Parallelizing of RDF Query Processing in P2P Based Distributed RDF Data Stores , 2014, 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[78]  Andreas Harth,et al.  Optimized index structures for querying RDF from the Web , 2005, Third Latin American Web Congress (LA-WEB'2005).

[79]  Paul T. Groth,et al.  NoSQL Databases for RDF: An Empirical Evaluation , 2013, International Semantic Web Conference.

[80]  Panos Kalnis,et al.  Combining Vertex-Centric Graph Processing with SPARQL for Large-Scale RDF Data Analytics , 2017, IEEE Transactions on Parallel and Distributed Systems.

[81]  Andy Seaborne,et al.  Clustered TDB: A Clustered Triple Store for Jena , 2008 .

[82]  Muhammad Saleem,et al.  LSQ: The Linked SPARQL Queries Dataset , 2015, SEMWEB.

[83]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[84]  Manolis Koubarakis,et al.  Atlas: Storing, updating and querying RDF(S) data on top of DHTs , 2010, J. Web Semant..

[85]  Georg Lausen,et al.  3rdf: Storing and Querying RDF Data on Top of the 3nuts Overlay Network , 2011, 2011 22nd International Workshop on Database and Expert Systems Applications.

[86]  Georg Lausen,et al.  SP2Bench: A SPARQL Performance Benchmark , 2008, Semantic Web Information Management.

[87]  Emanuele Della Valle,et al.  PAGE: A Distributed Infrastructure for Fostering RDF-Based Interoperability , 2006, DAIS.

[88]  Steffen Staab,et al.  Koral: A Glass Box Profiling System for Individual Components of Distributed RDF Stores , 2017, BLINK/NLIWoD3@ISWC.

[89]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[90]  Jens Lehmann,et al.  DBpedia SPARQL Benchmark - Performance Assessment with Real Queries on Real Data , 2011, SEMWEB.

[91]  Sungpack Hong,et al.  PGX.D: a fast distributed graph processing engine , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[92]  Michalis Vazirgiannis,et al.  Clustering and Community Detection in Directed Networks: A Survey , 2013, ArXiv.

[93]  Spyros Kotoulas,et al.  Scale-Out Processing of Large RDF Datasets , 2015, IEEE Transactions on Big Data.

[94]  Carlos Buil Aranda,et al.  Storage Balancing in P2P Based Distributed RDF Data Stores , 2017, DeSemWeb@ISWC.

[95]  Ioannis Konstantinou,et al.  H2RDF+: an efficient data management system for big RDF graphs , 2014, SIGMOD Conference.

[96]  V. S. Subrahmanian,et al.  COSI: Cloud Oriented Subgraph Identification in Massive Social Networks , 2010, 2010 International Conference on Advances in Social Networks Analysis and Mining.

[97]  Jürgen Umbrich,et al.  YARS2: A Federated Repository for Querying Graph Structured Data from the Web , 2007, ISWC/ASWC.

[98]  Pierre Genevès,et al.  A Multi-Criteria Experimental Ranking of Distributed SPARQL Evaluators , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[99]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[100]  Hoan Quoc Nguyen-Mau,et al.  Elastic and Scalable Processing of Linked Stream Data in the Cloud , 2013, SEMWEB.

[101]  Sherif Sakr,et al.  D-SPARQ: Distributed, Scalable and Efficient RDF Query Engine , 2013, International Semantic Web Conference.

[102]  Martin Richtarsky,et al.  UniStore: Querying a DHT-based Universal Storage , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[103]  Philippe Cudré-Mauroux,et al.  DiploCloud: Efficient and Scalable Management of RDF Data in the Cloud , 2016, IEEE Transactions on Knowledge and Data Engineering.

[104]  Ulf Leser,et al.  Querying Distributed RDF Data Sources with SPARQL , 2008, ESWC.

[105]  Daniel J. Abadi,et al.  Scalable SPARQL querying of large RDF graphs , 2011, Proc. VLDB Endow..

[106]  Ling Liu,et al.  Scaling Queries over Big RDF Graphs with Semantic Hash Partitioning , 2013, Proc. VLDB Endow..

[107]  Martin Theobald,et al.  TriAD: a distributed shared-nothing RDF engine based on asynchronous message passing , 2014, SIGMOD Conference.

[108]  Christian Bizer,et al.  The Berlin SPARQL Benchmark , 2009, Int. J. Semantic Web Inf. Syst..

[109]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[110]  Michael Färber,et al.  PRoST: Distributed Execution of SPARQL Queries Using Mixed Partitioning Strategies , 2018, EDBT.

[111]  Hai Jin,et al.  Scalable SPARQL querying using path partitioning , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[112]  Olivier Curé,et al.  On the Evaluation of RDF Distribution Algorithms Implemented over Apache Spark , 2015, SSWS@ISWC.

[113]  Peter Norvig The Semantic Web and the Semantics of the Web: Where Does Meaning Come From? , 2016, WWW.

[114]  Ling Liu,et al.  Efficient data partitioning model for heterogeneous graphs in the cloud , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[115]  N. Shadbolt,et al.  4store: The Design and Implementation of a Clustered RDF Store , 2009 .

[116]  Barry Bishop,et al.  The Features of BigOWLIM that Enabled the BBC's World Cup Website , 2010 .

[117]  Maria-Esther Vidal,et al.  Efficiently Joining Group Patterns in SPARQL Queries , 2010, ESWC.

[118]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[119]  Katja Hose,et al.  FedX: Optimization Techniques for Federated Query Processing on Linked Data , 2011, SEMWEB.

[120]  Laura M. Haas,et al.  Optimizing Queries Across Diverse Data Sources , 1997, VLDB.

[121]  Steffen Staab,et al.  SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data , 2012, SEMWEB.

[122]  Bhavani M. Thuraisingham,et al.  Jena-HBase: A Distributed, Scalable and Effcient RDF Triple Store , 2012, SEMWEB.

[123]  Peng Peng,et al.  Processing SPARQL queries over distributed RDF graphs , 2014, The VLDB Journal.

[124]  Steffen Staab,et al.  BeSEPPI: Semantic-Based Benchmarking of Property Path Implementations , 2019, ESWC.

[125]  Georg Lausen,et al.  Sempala: Interactive SPARQL Query Processing on Hadoop , 2014, SEMWEB.

[126]  Panos Kalnis,et al.  Adaptive Partitioning for Very Large RDF Data , 2015, ArXiv.

[127]  Karl Aberer,et al.  GridVine: An Infrastructure for Peer Information Management , 2007, IEEE Internet Computing.

[128]  Margaret H. Dunham,et al.  Join processing in relational databases , 1992, CSUR.

[129]  Sherif Sakr,et al.  DREAM: Distributed RDF Engine with Adaptive Query Planner and Minimal Communication , 2015, Proc. VLDB Endow..

[130]  Katja Hose,et al.  WARP: Workload-aware replication and partitioning for RDF , 2013, 2013 IEEE 29th International Conference on Data Engineering Workshops (ICDEW).

[131]  Georg Lausen,et al.  Querying Semantic Knowledge Bases with SQL-on-Hadoop , 2017, BeyondMR@SIGMOD.

[132]  Adina Crainiceanu,et al.  Rya: a scalable RDF triple store for the clouds , 2012, Cloud-I '12.

[133]  Rui Wang,et al.  Optimizing Distributed RDF Triplestores via a Locally Indexed Graph Partitioning , 2012, 2012 41st International Conference on Parallel Processing.

[134]  Jure Leskovec,et al.  Statistical properties of community structure in large social and information networks , 2008, WWW.

[135]  Isao Kojima,et al.  ADERIS: An Adaptive Query Processor for Joining Federated SPARQL Endpoints , 2011, OTM Conferences.

[136]  Hugh C. Davis,et al.  LHD: Optimising Linked Data Query Processing Using Parallelisation , 2013, LDOW.

[137]  Wolfgang Nejdl,et al.  Processing and Optimization of Complex Queries in Schema-Based P2P-Networks , 2004, DBISP2P.

[138]  Richard E. Schantz,et al.  High-performance, massively scalable distributed systems using the MapReduce software framework: the SHARD triple-store , 2010, PSI EtA '10.

[139]  Abraham Bernstein,et al.  Signal/Collect: Graph Algorithms for the (Semantic) Web , 2010, SEMWEB.

[140]  Dirk Grunwald,et al.  Using vertex-centric programming platforms to implement SPARQL queries on large graphs , 2014, IA3 '14.

[141]  Manolis Koubarakis,et al.  Evaluating Conjunctive Triple Pattern Queries over Large Structured Overlay Networks , 2006, SEMWEB.

[142]  Andrew M. Jenkinson,et al.  Report on the scalability of semantic web integration in BioMedBridges , 2015 .

[143]  Dan Suciu,et al.  From Theory to Practice: Efficient Join Query Evaluation in a Parallel Database System , 2015, SIGMOD Conference.

[144]  M. Tamer Özsu,et al.  Diversified Stress Testing of RDF Data Management Systems , 2014, SEMWEB.

[145]  Xiaoyong Du,et al.  Efficient SPARQL Query Evaluation in a Database Cluster , 2013, 2013 IEEE International Congress on Big Data.

[146]  Dominic Battré,et al.  On Triple Dissemination, Forward-Chaining, and Load Balancing in DHT Based RDF Stores , 2005, DBISP2P.

[147]  Abraham Bernstein,et al.  Random Walk TripleRush: Asynchronous Graph Querying and Sampling , 2015, WWW.

[148]  Said Mirza Pahlevi,et al.  RDFCube: A P2P-Based Three-Dimensional Index for Structural Joins on Distributed Triple Stores , 2005, DBISP2P.

[149]  Letizia Tanca,et al.  Semantic Web Information Management - A Model-Based Perspective , 2009, Semantic Web Information Management.

[150]  Manfred Hauswirth,et al.  DAW: Duplicate-AWare Federated Query Processing over the Web of Data , 2013, SEMWEB.

[151]  Panos Kalnis,et al.  Accelerating SPARQL queries by exploiting hash-based locality and adaptive partitioning , 2016, The VLDB Journal.

[152]  Deborah L. McGuinness,et al.  Tracking RDF Graph Provenance using RDF Molecules , 2005 .

[153]  Min Cai,et al.  RDFPeers: a scalable distributed RDF repository based on a structured peer-to-peer network , 2004, WWW '04.

[154]  Bhavani M. Thuraisingham,et al.  Heuristics-Based Query Processing for Large RDF Graphs Using Cloud Computing , 2011, IEEE Transactions on Knowledge and Data Engineering.