An analytical study of large SPARQL query logs

With the adoption of RDF as the data model for Linked Data and the Semantic Web, query specification from end-users has become more and more common in SPARQL endpoints. In this paper, we conduct an in-depth analytical study of the queries formulated by end-users and harvested from large and up-to-date query logs from a wide variety of RDF data sources. As opposed to previous studies, ours is the first assessment on a voluminous query corpus, spanning over several years and covering many representative SPARQL endpoints. Apart from the syntactical structure of the queries, that exhibits already interesting results on this generalized corpus, we drill deeper in the structural characteristics related to the graph and hypergraph representation of queries. We outline the most common shapes of queries when visually displayed as undirected graphs, and characterize their (hyper-)tree width. Moreover, we analyze the evolution of queries over time, by introducing the novel concept of a streak, i.e., a sequence of queries that appear as subsequent modifications of a seed query. Our study offers several fresh insights on the already rich query features of real SPARQL queries formulated by real users, and brings us to draw a number of conclusions and pinpoint future directions for SPARQL query evaluation, query optimization, tuning, and benchmarking.

[1]  Kunle Olukotun,et al.  Old techniques for new join algorithms: A case study in RDF processing , 2016, 2016 IEEE 32nd International Conference on Data Engineering Workshops (ICDEW).

[2]  Wim Martens,et al.  Enumeration Problems for Regular Path Queries , 2017, ArXiv.

[3]  Muhammad Saleem,et al.  LSQ: The Linked SPARQL Queries Dataset , 2015, SEMWEB.

[4]  Benny Kimelfeld,et al.  Flexible Caching in Trie Joins , 2016, EDBT.

[5]  Markus Krötzsch,et al.  Practical Linked Data Access via SPARQL: The Case of Wikidata , 2018, LDOW@WWW.

[6]  Anand Rajaraman,et al.  Conjunctive query containment revisited , 2000, Theor. Comput. Sci..

[7]  Reinhard Pichler,et al.  Efficient Evaluation and Approximation of Well-designed Pattern Trees , 2015, PODS.

[8]  Marcelo Arenas,et al.  Semantics and complexity of SPARQL , 2006, TODS.

[9]  Sebastian Berndt,et al.  Jdrasil: A Modular Library for Computing Tree Decompositions , 2017, SEA.

[10]  Adriane Chapman,et al.  Making database systems usable , 2007, SIGMOD '07.

[11]  Wim Martens,et al.  Minimization of Tree Patterns , 2018, J. ACM.

[12]  Maria-Esther Vidal,et al.  Efficiently Joining Group Patterns in SPARQL Queries , 2010, ESWC.

[13]  Joachim Niehren,et al.  On the minimization of XML Schemas and tree automata for unranked trees , 2007, J. Comput. Syst. Sci..

[14]  Egor V. Kostylev,et al.  Beyond Well-designed SPARQL , 2016, ICDT.

[15]  Heiko Paulheim,et al.  What SPARQL Query Logs Tell and Do Not Tell about Semantic Relatedness in LOD Or: The Unsuccessful Attempt to Improve the Browsing Experience of DBpedia by Exploiting Query Logs , 2015, NoISE@ESWC.

[16]  H. V. Jagadish,et al.  Guided interaction , 2011, VLDB 2011.

[17]  Gerhard Weikum,et al.  The RDF-3X engine for scalable management of RDF data , 2010, The VLDB Journal.

[18]  Siegfried Handschuh,et al.  Learning from Linked Open Data Usage: Patterns & Metrics , 2010 .

[19]  Georg Gottlob,et al.  Hypertree Decompositions: Questions and Answers , 2016, PODS.

[20]  Georg Gottlob,et al.  Treewidth and Hypertree Width , 2014, Tractability.

[21]  Wim Martens,et al.  The (Almost) Complete Guide to Tree Pattern Containment , 2015, PODS.

[22]  Ashok K. Chandra,et al.  Optimal implementation of conjunctive queries in relational data bases , 1977, STOC '77.

[23]  Markus Krötzsch,et al.  Wikidata , 2014, Commun. ACM.

[24]  Jens Lehmann,et al.  DBpedia SPARQL Benchmark - Performance Assessment with Real Queries on Real Data , 2011, SEMWEB.

[25]  Stijn Vansummeren,et al.  The Dynamic Yannakakis Algorithm: Compact and Efficient Query Processing Under Updates , 2017, SIGMOD Conference.

[26]  Markus Krötzsch,et al.  Getting the Most Out of Wikidata: Semantic Technology Usage in Wikipedia's Knowledge Graph , 2018, SEMWEB.

[27]  Arnaud Durand,et al.  On Acyclic Conjunctive Queries and Constant Delay Enumeration , 2007, CSL.

[28]  Reinhard Pichler,et al.  On the Complexity of Enumerating the Answers to Well-Designed Pattern Trees , 2016, AMW.

[29]  Domagoj Vrgoc,et al.  Querying Graphs with Data , 2016, J. ACM.

[30]  George H. L. Fletcher,et al.  gMark: Schema-Driven Generation of Graphs and Queries , 2015, IEEE Transactions on Knowledge and Data Engineering.

[31]  Dan Suciu,et al.  Containment and equivalence for a fragment of XPath , 2004, JACM.

[32]  Xin Wang,et al.  On the statistical analysis of practical SPARQL queries , 2016, WebDB.

[33]  Kunle Olukotun,et al.  EmptyHeaded: A Relational Engine for Graph Processing , 2015, ACM Trans. Database Syst..

[34]  Stijn Vansummeren,et al.  What are real SPARQL queries like? , 2011, SWIM '11.

[35]  George H. L. Fletcher,et al.  Declarative Graph Querying in Practice and Theory , 2017, EDBT.

[36]  Yehoshua Sagiv,et al.  Revisiting redundancy and minimization in an XPath fragment , 2008, EDBT '08.

[37]  Pablo de la Fuente,et al.  An Empirical Study of Real-World SPARQL Queries , 2011, ArXiv.