LargeRDFBench: A billion triples benchmark for SPARQL endpoint federation

Gathering information from the distributed Web of Data is commonly carried out by using SPARQL query federation approaches. However, the fitness of current SPARQL query federation approaches for real applications is difficult to evaluate with current benchmarks as they are either synthetic, too small in size and complexity or do not provide means for a fine-grained evaluation. We propose LargeRDFBench, a billion-triple benchmark for SPARQL query federation which encompasses real data as well as real queries pertaining to real bio-medical use cases. We evaluate state-of-the-art SPARQL endpoint federation approaches on this benchmark with respect to their query runtime, triple pattern-wise source selection, number of endpoints requests, and result completeness and correctness. Our evaluation results suggest that the performance of current SPARQL query federation systems on simple queries (in terms of total triple patterns, query result set sizes, execution time, use of SPARQL features etc.) does not reflect the systems' performance on more complex queries. Moreover, current federation systems seem unable to deal with real queries that involve processing large intermediate result sets or lead to large result sets.

[1]  Stefan Decker,et al.  TopFed: TCGA tailored federated query processing and linking to LOD , 2014, Journal of Biomedical Semantics.

[2]  Steffen Staab,et al.  SPLENDID: SPARQL Endpoint Federation Exploiting VOID Descriptions , 2011, COLD.

[3]  Stefan Decker,et al.  SAFE: Policy Aware SPARQL Query Federation Over RDF Data Cubes , 2014, SWAT4LS.

[4]  Hongyan Wu,et al.  BioBenchmark Toyama 2012: an evaluation of the performance of triple stores on biological data , 2014, J. Biomed. Semant..

[5]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[6]  Dietrich Rebholz-Schuhmann,et al.  SAFE: SPARQL Federation over RDF Data Cubes with Access Control , 2017, J. Biomed. Semant..

[7]  Marcelo Arenas,et al.  On the Semantics of SPARQL , 2009, Semantic Web Information Management.

[9]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[10]  Liyang Yu Linked Open Data , 2011 .

[11]  Georg Lausen,et al.  SP2Bench: A SPARQL Performance Benchmark , 2008, Semantic Web Information Management.

[12]  Maribel Acosta,et al.  ANAPSID: An Adaptive Query Processing Engine for SPARQL Endpoints , 2011, SEMWEB.

[13]  Stijn Vansummeren,et al.  What are real SPARQL queries like? , 2011, SWIM '11.

[14]  Günter Ladwig,et al.  FedBench: A Benchmark Suite for Federated Semantic Data Query Processing , 2011, SEMWEB.

[15]  Muhammad Saleem,et al.  A fine-grained evaluation of SPARQL endpoint federation systems , 2016, Semantic Web.

[16]  Muhammad Saleem,et al.  HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation , 2014, ESWC.

[17]  M. Tamer Özsu,et al.  Diversified Stress Testing of RDF Data Management Systems , 2014, SEMWEB.

[18]  Jens Lehmann,et al.  DBpedia SPARQL Benchmark - Performance Assessment with Real Queries on Real Data , 2011, SEMWEB.

[19]  Georg Lausen,et al.  SP^2Bench: A SPARQL Performance Benchmark , 2008, 2009 IEEE 25th International Conference on Data Engineering.

[20]  Xiaohua Hu,et al.  Methods for evaluating gene expression from Affymetrix microarray datasets , 2008, BMC Bioinformatics.

[21]  Jeff Heflin,et al.  LUBM: A benchmark for OWL knowledge base systems , 2005, J. Web Semant..

[22]  Olaf Hartig,et al.  An Overview on Execution Strategies for Linked Data Queries , 2013, Datenbank-Spektrum.

[23]  Muhammad Saleem,et al.  FEASIBLE: A Feature-Based SPARQL Benchmark Generation Framework , 2015, SEMWEB.

[24]  Manfred Hauswirth,et al.  DAW: Duplicate-AWare Federated Query Processing over the Web of Data , 2013, SEMWEB.

[25]  Katja Hose,et al.  FedX: Optimization Techniques for Federated Query Processing on Linked Data , 2011, SEMWEB.

[26]  Steffen Staab,et al.  SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data , 2012, SEMWEB.

[27]  Aftab Iqbal Fostering Serendipity through Big Linked Data , 2013 .

[28]  Paul Ian Clifford,et al.  Database Management System , 2008 .

[29]  Maria-Esther Vidal,et al.  Benchmarking Federated SPARQL Query Engines: Are Existing Testbeds Enough? , 2012, International Semantic Web Conference.

[30]  Octavian Udrea,et al.  Apples and oranges: a comparison of RDF benchmarks and real RDF datasets , 2011, SIGMOD '11.

[31]  Muhammad Saleem,et al.  LSQ: The Linked SPARQL Queries Dataset , 2015, SEMWEB.

[32]  Muhammad Saleem,et al.  Big linked cancer data: Integrating linked TCGA and PubMed , 2014, J. Web Semant..

[33]  Christian Bizer,et al.  The Berlin SPARQL Benchmark , 2009, Int. J. Semantic Web Inf. Syst..

[34]  Dietrich Rebholz-Schuhmann,et al.  BioFed: federated query processing over life sciences linked open data , 2017, J. Biomed. Semant..

[35]  Maribel Acosta,et al.  A Heuristic-Based Approach for Planning Federated SPARQL Queries , 2012, COLD.