Resource Planning for SPARQL Query Execution on Data Sharing Platforms

To increase performance, data sharing platforms often make use of clusters of nodes where certain tasks can be executed in parallel. Resource planning and especially deciding how many processors should be chosen to exploit parallel processing is complex in such a setup as increasing the number of processors does not always improve runtime due to communication overhead. Instead, there is usually an optimum number of processors for which using more or fewer processors leads to less efficient runtimes. In this paper, we present a cost model based on widely used statistics (VoiD) and show how to compute the optimum number of processors that should be used to evaluate a particular SPARQL query over a particular configuration and RDF dataset. Our first experiments show the general applicability of our approach but also how shortcomings in the used statistics limit the potential of optimization.

[1]  Philip S. Yu,et al.  Scheduling and processor allocation for parallel execution of multijoin queries , 1992, [1992] Eighth International Conference on Data Engineering.

[2]  Gerhard Weikum,et al.  RDF-3X: a RISC-style engine for RDF , 2008, Proc. VLDB Endow..

[3]  Georg Lausen,et al.  PigSPARQL: mapping SPARQL to Pig Latin , 2011, SWIM '11.

[4]  Guido Moerkotte,et al.  Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[5]  David Maier,et al.  From databases to dataspaces: a new abstraction for information management , 2005, SGMD.

[6]  Daniel J. Abadi,et al.  Scalable SPARQL querying of large RDF graphs , 2011, Proc. VLDB Endow..

[7]  N. Shadbolt,et al.  4store: The Design and Implementation of a Clustered RDF Store , 2009 .

[8]  Stavros Christodoulakis,et al.  On the propagation of errors in the size of join results , 1991, SIGMOD '91.

[9]  Günter Ladwig,et al.  Linked Data Query Processing Strategies , 2010, SEMWEB.

[10]  Archana Ganapathi,et al.  Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[11]  Maribel Acosta,et al.  ANAPSID: An Adaptive Query Processing Engine for SPARQL Endpoints , 2011, SEMWEB.

[12]  Michael Hausenblas,et al.  Describing Linked Datasets , 2009, LDOW.

[13]  Katja Hose,et al.  FedX: Optimization Techniques for Federated Query Processing on Linked Data , 2011, SEMWEB.

[14]  Minos N. Garofalakis,et al.  Parallel Query Scheduling and Optimization with Time- and Space-Shared Resources , 1997, VLDB.

[15]  Jürgen Umbrich,et al.  Linked Data and Live Querying for Enabling Support Platforms for Web Dataspaces , 2012, 2012 IEEE 28th International Conference on Data Engineering Workshops.

[16]  Steffen Staab,et al.  SPLENDID: SPARQL Endpoint Federation Exploiting VOID Descriptions , 2011, COLD.

[17]  Arun N. Swami,et al.  On the Estimation of Join Result Sizes , 1994, EDBT.

[18]  Kai-Uwe Sattler,et al.  LODHub — A platform for sharing and integrated processing of linked open data , 2014, 2014 IEEE 30th International Conference on Data Engineering Workshops.