On Metrics for Measuring Fragmentation of Federation over SPARQL Endpoints

Processing a federated query in Linked Data is challenging b ecause it needs to consider the number of sources, the source locations as well as heterogeneous system such as hardware, software and data structure and distribution. In this work, we investigate the relationship betwe en the data distribution and the communication cost in a federated SPARQL query framework. We introduce the spre ading factor as a dataset metric for computing the distribution of classes and properties throughout a set of data sources. To observe the relationship between the spreading factor and the communication cost, we generat e 9 datasets by using several data fragmentation and allocation strategies. Our experimental results showe d that the spreading factor is correlated with the communication cost between a federated engine and the SPARQL en dpoints . In terms of partitioning strategies, partitioning triples based on the properties and classes ca n minimize the communication cost. However, such partitioning can also reduce the performance of SPARQL endp oi t within the federation framework.

[1]  Jürgen Umbrich,et al.  Querying over Federated SPARQL Endpoints - A State of the Art Survey , 2013, ArXiv.

[2]  Maria-Esther Vidal,et al.  Benchmarking Federated SPARQL Query Engines: Are Existing Testbeds Enough? , 2012, International Semantic Web Conference.

[3]  Katja Hose,et al.  An Experience Report of Large Scale Federations , 2012, ArXiv.

[4]  Nur Aini Rakhmawati,et al.  On the Impact of Data Distribution in Federated SPARQL Queries , 2012, 2012 IEEE Sixth International Conference on Semantic Computing.

[5]  Fabian Prasser,et al.  Efficient distributed query processing for autonomous RDF databases , 2012, EDBT '12.

[6]  Steffen Staab,et al.  SPLENDID: SPARQL Endpoint Federation Exploiting VOID Descriptions , 2011, COLD.

[7]  Günter Ladwig,et al.  FedBench: A Benchmark Suite for Federated Semantic Data Query Processing , 2011, SEMWEB.

[8]  Daniel J. Abadi,et al.  Scalable SPARQL querying of large RDF graphs , 2011, Proc. VLDB Endow..

[9]  Octavian Udrea,et al.  Apples and oranges: a comparison of RDF benchmarks and real RDF datasets , 2011, SIGMOD '11.

[10]  Pablo de la Fuente,et al.  An Empirical Study of Real-World SPARQL Queries , 2011, ArXiv.

[11]  Georg Lausen,et al.  SP^2Bench: A SPARQL Performance Benchmark , 2008, 2009 IEEE 25th International Conference on Data Engineering.

[12]  Ulf Leser,et al.  Querying Distributed RDF Data Sources with SPARQL , 2008, ESWC.

[13]  Daniel J. Abadi,et al.  Scalable Semantic Web Data Management Using Vertical Partitioning , 2007, VLDB.

[14]  Andy Seaborne,et al.  ARQo : The Architecture for an ARQ Static Query Optimizer , 2007 .

[15]  Kevin Wilkinson,et al.  Jena Property Table Implementation , 2006 .

[16]  Jeff Heflin,et al.  LUBM: A benchmark for OWL knowledge base systems , 2005, J. Web Semant..

[17]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..