An analytical study of large SPARQL query logs

With the adoption of RDF as the data model for Linked Data and the Semantic Web, query specification from end users has become more and more common in SPARQL endpoints. In this paper, we conduct an in-depth analytical study of the queries formulated by end users and harvested from large and up-to-date structured query logs from a wide variety of RDF data sources. As opposed to previous studies, ours is the first assessment on a voluminous query corpus, spanning over several years and covering many representative SPARQL endpoints. Apart from the syntactical structure of the queries that exhibits already interesting results on this generalized corpus, we drill deeper in the structural characteristics related to the graph and hypergraph representation of queries. We outline the most common shapes of queries when visually displayed as undirected graphs and characterize their treewidth, length of their cycles, maximal degree of nodes, and more. For queries that cannot be adequately represented as graphs, we investigate their hypergraphs and hypertreewidth. Moreover, we analyze the evolution of queries over time, by introducing the novel concept of a streak, i.e., a sequence of queries that appear as subsequent modifications of a seed query. Our study offers several fresh insights on the already rich query features of real SPARQL queries formulated by real users and brings us to draw a number of conclusions and pinpoint future directions for SPARQL query evaluation, query optimization, tuning, and benchmarking.

[1]  Georg Gottlob,et al.  Hypertree decompositions and tractable queries , 1998, J. Comput. Syst. Sci..

[2]  Wim Martens,et al.  An Analytical Study of Large SPARQL Query Logs , 2017, Proc. VLDB Endow..

[3]  Adriane Chapman,et al.  Making database systems usable , 2007, SIGMOD '07.

[4]  Egor V. Kostylev,et al.  Beyond Well-designed SPARQL , 2016, ICDT.

[5]  Stijn Vansummeren,et al.  What are real SPARQL queries like? , 2011, SWIM '11.

[6]  Heiko Paulheim,et al.  What SPARQL Query Logs Tell and Do Not Tell about Semantic Relatedness in LOD Or: The Unsuccessful Attempt to Improve the Browsing Experience of DBpedia by Exploiting Query Logs , 2015, NoISE@ESWC.

[7]  Wim Martens,et al.  Navigating the Maze of Wikidata Query Logs , 2019, WWW.

[8]  Reinhard Pichler,et al.  On the Complexity of Enumerating the Answers to Well-Designed Pattern Trees , 2016, AMW.

[9]  Ashok K. Chandra,et al.  Optimal implementation of conjunctive queries in relational data bases , 1977, STOC '77.

[10]  Siegfried Handschuh,et al.  Learning from Linked Open Data Usage: Patterns & Metrics , 2010 .

[11]  Jens Lehmann,et al.  DBpedia SPARQL Benchmark - Performance Assessment with Real Queries on Real Data , 2011, SEMWEB.

[12]  Wim Martens,et al.  Enumeration Problems for Regular Path Queries , 2017, ArXiv.

[13]  Maria-Esther Vidal,et al.  Efficiently Joining Group Patterns in SPARQL Queries , 2010, ESWC.

[14]  Jorge Pérez,et al.  Static analysis and optimization of semantic web queries , 2012, PODS '12.

[15]  Yehoshua Sagiv,et al.  Revisiting redundancy and minimization in an XPath fragment , 2008, EDBT '08.

[16]  Domagoj Vrgoc,et al.  Querying Graphs with Data , 2016, J. ACM.

[17]  Georg Gottlob,et al.  Treewidth and Hypertree Width , 2014, Tractability.

[18]  Wim Martens,et al.  Minimization of Tree Patterns , 2018, J. ACM.

[19]  Xin Wang,et al.  On the statistical analysis of practical SPARQL queries , 2016, WebDB.

[20]  Wim Martens,et al.  The (Almost) Complete Guide to Tree Pattern Containment , 2015, PODS.

[21]  Kunle Olukotun,et al.  EmptyHeaded: A Relational Engine for Graph Processing , 2015, ACM Trans. Database Syst..

[22]  George H. L. Fletcher,et al.  gMark: Schema-Driven Generation of Graphs and Queries , 2015, IEEE Transactions on Knowledge and Data Engineering.

[23]  Dan Suciu,et al.  Containment and equivalence for a fragment of XPath , 2004, JACM.

[24]  Georg Gottlob,et al.  Hypertree Decompositions: Questions and Answers , 2016, PODS.

[25]  Markus Krötzsch,et al.  Wikidata , 2014, Commun. ACM.

[26]  George H. L. Fletcher,et al.  Querying Graphs , 2018, Querying Graphs.

[27]  Sebastian Berndt,et al.  Jdrasil: A Modular Library for Computing Tree Decompositions , 2017, SEA.

[28]  Joachim Niehren,et al.  On the minimization of XML Schemas and tree automata for unranked trees , 2007, J. Comput. Syst. Sci..

[29]  Markus Krötzsch,et al.  Getting the Most Out of Wikidata: Semantic Technology Usage in Wikipedia's Knowledge Graph , 2018, SEMWEB.

[30]  Arnaud Durand,et al.  On Acyclic Conjunctive Queries and Constant Delay Enumeration , 2007, CSL.

[31]  Reinhard Pichler,et al.  Efficient Evaluation and Approximation of Well-designed Pattern Trees , 2015, PODS.

[32]  Marcelo Arenas,et al.  Semantics and complexity of SPARQL , 2006, TODS.

[33]  Anand Rajaraman,et al.  Conjunctive query containment revisited , 1997, Theor. Comput. Sci..

[34]  Markus Krötzsch,et al.  Practical Linked Data Access via SPARQL: The Case of Wikidata , 2018, LDOW@WWW.

[35]  Gerhard Weikum,et al.  The RDF-3X engine for scalable management of RDF data , 2010, The VLDB Journal.

[36]  H. V. Jagadish,et al.  Guided Interaction: Rethinking the Query-Result Paradigm , 2011, Proc. VLDB Endow..

[37]  Benny Kimelfeld,et al.  Flexible Caching in Trie Joins , 2016, EDBT.

[38]  Kunle Olukotun,et al.  Old techniques for new join algorithms: A case study in RDF processing , 2016, 2016 IEEE 32nd International Conference on Data Engineering Workshops (ICDEW).

[39]  Stijn Vansummeren,et al.  The Dynamic Yannakakis Algorithm: Compact and Efficient Query Processing Under Updates , 2017, SIGMOD Conference.

[40]  Pablo de la Fuente,et al.  An Empirical Study of Real-World SPARQL Queries , 2011, ArXiv.

[41]  Muhammad Saleem,et al.  LSQ: The Linked SPARQL Queries Dataset , 2015, SEMWEB.