smart-KG: Partition-Based Linked Data Fragments for Querying Knowledge Graphs

. RDF and SPARQL provide a uniform way to publish and query billions of triples in open knowledge graphs (KGs) on the Web. Yet, provisioning of a fast, reliable, and responsive live querying solution for open KGs is still hardly possible through SPARQL endpoints alone: while such endpoints provide a remarkable performance for single queries, they typically can not cope with highly concurrent query workloads by multiple clients. To mitigate this, the Linked Data Fragments (LDF) framework sparked the design of different alternative low-cost interfaces such as Triple Pattern Fragments (TPF), that partially offload the query processing workload to the client side. On the downside, such interfaces come with the expense of higher network load due to the necessary transfer of intermediate results to the client, also leading to query performance degradation compared with endpoints. To address this problem, in this work, we investigate alternative interfaces able to ship partitions of KGs from the server to the client, which aim at reducing server-resource consumption. To this extent, first, we align formal definitions and notations of the original LDF framework to uniformly present partition-based LDF approaches. These novel LDF interfaces retrieve, instead of the exact triples matching a particular query pattern, a subset of partitions from materialized, compressed graph partitions to be further evaluated on the client side. Then, we present smart-KG , a concrete partition-based LDF approach. Our proposed approach is a step forward towards a better-balanced share of query processing load between clients and servers by shipping graph partitions driven by the structure of RDF graphs to group entities described with the same sets of properties and classes, resulting in significant data transfer reduction. Our experiments demonstrate that smart-KG significantly outperforms existing Web SPARQL interfaces on both pre-existing benchmarks for highly concurrent query execution as well as a novel query workload benchmark we introduce – inspired by query logs of existing SPARQL endpoints.

[1]  Maribel Acosta,et al.  Characteristic sets profile features: Estimation and application to SPARQL query planning , 2022, Semantic Web.

[2]  Maribel Acosta,et al.  Robust query processing for linked data fragments , 2022, Semantic Web.

[3]  Axel Polleres,et al.  WiseKG: Balanced Access to Web Knowledge Graphs , 2021, WWW.

[4]  Bin Yao,et al.  A survey of RDF stores & SPARQL engines for querying knowledge graphs , 2021, The VLDB Journal.

[5]  Maribel Acosta,et al.  A Framework for Federated SPARQL Query Processing over Heterogeneous Linked Data Fragments , 2021, ArXiv.

[6]  Maribel Acosta,et al.  Cost- and Robustness-Based Query Optimization for Linked Data Fragments , 2020, SEMWEB.

[7]  K. Hose,et al.  Star Pattern Fragments: Accessing Knowledge Graphs through Star Patterns , 2020, ArXiv.

[8]  Axel Polleres,et al.  A More Decentralized Vision for Linked Data , 2020, DeSemWeb@ISWC.

[9]  Maribel Acosta,et al.  SMART-KG: Hybrid Shipping for SPARQL Querying on the Web , 2020, WWW.

[10]  Erik G. Hoel,et al.  Distributed Spatial and Spatio-Temporal Join on Apache Spark , 2019, ACM Trans. Spatial Algorithms Syst..

[11]  Wim Martens,et al.  Navigating the Maze of Wikidata Query Logs , 2019, WWW.

[12]  Stefan Decker,et al.  Knowledge Graphs: New Directions for Knowledge Representation on the Semantic Web (Dagstuhl Seminar 18371) , 2019, Dagstuhl Reports.

[13]  Hala Skaf-Molli,et al.  SaGe: Web Preemption for Public SPARQL Query Services , 2019, WWW.

[14]  Muhammad Saleem,et al.  An Empirical Evaluation of RDF Graph Partitioning Techniques , 2018, EKAW.

[15]  Walid G. Aref,et al.  WORQ: Workload-Driven RDF Query Processing , 2018, SEMWEB.

[16]  Ruben Verborgh,et al.  Comunica: A Modular SPARQL Query Engine for the Web , 2018, SEMWEB.

[17]  Pablo de la Fuente,et al.  Characterising RDF data sets , 2018, J. Inf. Sci..

[18]  Michael Färber,et al.  PRoST: Distributed Execution of SPARQL Queries Using Mixed Partitioning Strategies , 2018, EDBT.

[19]  Jorge Pérez,et al.  A Formal Framework for Comparing Linked Data Fragments , 2017, SEMWEB.

[20]  Jens Lehmann,et al.  Distributed Semantic Analytics Using the SANSA Stack , 2017, SEMWEB.

[21]  Steffen Staab,et al.  Koral: A Glass Box Profiling System for Individual Components of Distributed RDF Stores , 2017, BLINK/NLIWoD3@ISWC.

[22]  Panos Kalnis,et al.  A Survey and Experimental Comparison of Distributed SPARQL Engines for Very Large RDF Data , 2017, Proc. VLDB Endow..

[23]  Wim Martens,et al.  An analytical study of large SPARQL query logs , 2017, The VLDB Journal.

[24]  Aidan Hogan,et al.  Canonical Forms for Isomorphic and Equivalent RDF Graphs , 2017, ACM Trans. Web.

[25]  Nikos Mamoulis,et al.  Extended Characteristic Sets: Graph Indexing for SPARQL Query Optimization , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[26]  Olaf Hartig,et al.  Bindings-Restricted Triple Pattern Fragments , 2016, OTM Conferences.

[27]  Pierre Genevès,et al.  SPARQLGX in Action: Efficient Distributed Evaluation of SPARQL with Apache Spark , 2016, SEMWEB.

[28]  Panos Kalnis,et al.  Accelerating SPARQL queries by exploiting hash-based locality and adaptive partitioning , 2016, The VLDB Journal.

[29]  Ruben Verborgh,et al.  Triple Pattern Fragments: A low-cost knowledge graph interface for the Web , 2016, J. Web Semant..

[30]  Nieves R. Brisaboa,et al.  Practical compressed string dictionaries , 2016, Inf. Syst..

[31]  Georg Lausen,et al.  S2RDF: RDF Querying with SPARQL on Spark , 2015, Proc. VLDB Endow..

[32]  Muhammad Saleem,et al.  FEASIBLE: A Feature-Based SPARQL Benchmark Generation Framework , 2015, SEMWEB.

[33]  Maribel Acosta,et al.  Networks of Linked Data Eddies: An Adaptive Web Query Processing Engine for RDF Data , 2015, SEMWEB.

[34]  Muhammad Saleem,et al.  LSQ: The Linked SPARQL Queries Dataset , 2015, SEMWEB.

[35]  I. Manolescu,et al.  CliqueSquare in action: Flat plans for massively parallel RDF queries , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[36]  Miguel A. Martínez-Prieto,et al.  Serializing RDF in Compressed Space , 2015, 2015 Data Compression Conference.

[37]  Huajun Chen,et al.  SparkRDF: Elastic Discreted RDF Graph Processing Engine With Distributed Memory , 2014, 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT).

[38]  M. Tamer Özsu,et al.  Diversified Stress Testing of RDF Data Management Systems , 2014, SEMWEB.

[39]  Georg Lausen,et al.  Sempala: Interactive SPARQL Query Processing on Hadoop , 2014, SEMWEB.

[40]  Rinke Hoekstra,et al.  Structural Properties as Proxy for Semantic Relevance in RDF Graph Sampling , 2014, SEMWEB.

[41]  Jürgen Umbrich,et al.  Strategies for Executing Federated Queries in SPARQL1.1 , 2014, SEMWEB.

[42]  Markus Krötzsch,et al.  Wikidata , 2014, Commun. ACM.

[43]  Ioana Manolescu,et al.  RDF in the clouds: a survey , 2014, The VLDB Journal.

[44]  Martin Theobald,et al.  TriAD: a distributed shared-nothing RDF engine based on asynchronous message passing , 2014, SIGMOD Conference.

[45]  Jürgen Umbrich,et al.  SPARQL Web-Querying Infrastructure: Ready for Action? , 2013, SEMWEB.

[46]  Ling Liu,et al.  Scaling Queries over Big RDF Graphs with Semantic Hash Partitioning , 2013, Proc. VLDB Endow..

[47]  Gerhard Weikum,et al.  YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia: Extended Abstract , 2013, IJCAI.

[48]  Katja Hose,et al.  WARP: Workload-aware replication and partitioning for RDF , 2013, 2013 IEEE 29th International Conference on Data Engineering Workshops (ICDEW).

[49]  Min Wang,et al.  EAGRE: Towards scalable I/O efficient SPARQL query evaluation on the cloud , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[50]  Axel Polleres,et al.  Binary RDF representation for publication and exchange (HDT) , 2013, J. Web Semant..

[51]  Katja Hose,et al.  Partout: a distributed engine for efficient RDF processing , 2012, WWW.

[52]  Mark A. Musen,et al.  Using SPARQL to Query BioPortal Ontologies and Metadata , 2012, SEMWEB.

[53]  Bhavani M. Thuraisingham,et al.  Jena-HBase: A Distributed, Scalable and Effcient RDF Triple Store , 2012, SEMWEB.

[54]  Haofen Wang,et al.  HadoopRDF: A Scalable Semantic Data Analytical Engine , 2012, ICIC.

[55]  Miguel A. Martínez-Prieto,et al.  Exchange and Consumption of Huge RDF Data , 2012, ESWC.

[56]  Daniel J. Abadi,et al.  Scalable SPARQL querying of large RDF graphs , 2011, Proc. VLDB Endow..

[57]  Georg Lausen,et al.  PigSPARQL: mapping SPARQL to Pig Latin , 2011, SWIM '11.

[58]  Guido Moerkotte,et al.  Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[59]  Frank van Harmelen,et al.  Finding the Achilles Heel of the Web of Data: Using Network Analysis for Link-Recommendation , 2010, SEMWEB.

[60]  Richard E. Schantz,et al.  High-performance, massively scalable distributed systems using the MapReduce software framework: the SHARD triple-store , 2010, PSI EtA '10.

[61]  Michael Martin,et al.  Improving the Performance of Semantic Web Applications with SPARQL Query Caching , 2010, ESWC.

[62]  Daniel J. Abadi,et al.  SW-Store: a vertically partitioned DBMS for Semantic Web data management , 2009, The VLDB Journal.

[63]  Michael Schmidt,et al.  Foundations of SPARQL query optimization , 2008, ICDT '10.

[64]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[65]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[66]  Jürgen Umbrich,et al.  YARS2: A Federated Repository for Querying Graph Structured Data from the Web , 2007, ISWC/ASWC.

[67]  Daniel J. Abadi,et al.  Scalable Semantic Web Data Management Using Vertical Partitioning , 2007, VLDB.

[68]  Marcelo Arenas,et al.  Semantics and complexity of SPARQL , 2006, TODS.

[69]  Alberto O. Mendelzon,et al.  Foundations of semantic web databases , 2004, PODS.

[70]  Michael Ley,et al.  The DBLP Computer Science Bibliography: Evolution, Research Issues, Perspectives , 2002, SPIRE.

[71]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[72]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[73]  Laura M. Haas,et al.  Optimizing Queries Across Diverse Data Sources , 1997, VLDB.

[74]  Jürgen Umbrich,et al.  SPARQLES: Monitoring public SPARQL endpoints , 2017, Semantic Web.

[75]  Jens Lehmann,et al.  DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia , 2015, Semantic Web.

[76]  Thomas Neumann,et al.  Exploiting the query structure for efficient join ordering in SPARQL queries , 2014, EDBT.

[77]  Yves Raimond,et al.  RDF 1.1 Primer , 2014 .

[78]  M. Tamer Özsu,et al.  chameleon-db: a Workload-Aware Robust RDF Data Management System , 2013 .

[79]  A. Tangel,et al.  A high performance , 2013 .

[80]  Andreas Harth,et al.  CumulusRDF: Linked Data Management on Nested Key-Value Stores , 2011 .

[81]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .