Workload-Aware Self-tuning Histograms for the Semantic Web

Query processing systems typically rely on histograms, data structures that approximate data distribution, in order to optimize query execution. Histograms can be constructed by scanning the database tables and aggregating the values of the attributes in the table, or, more efficiently, progressively refined by analysing query results. Most of the relevant literature focuses on histograms of numerical data, exploiting the natural concept of a numerical range as an estimator of the volume of data that falls within the range. This, however, leaves Semantic Web data outside the scope of the histograms literature, as its most prominent datatype, the URI, does not offer itself to defining such ranges. This article first establishes a framework that formalises histograms over arbitrary data types and provides a formalism for specifying value ranges for different datatypes. This makes explicit the properties that ranges are required to have, so that histogram refinement algorithms are applicable. We demonstrate that our framework subsumes histograms over numerical data as a special case by using to formulate the state-of-the-art in numerical histograms. We then proceed to use the Jaro-Winkler metric to define URI ranges by exploiting the hierarchical nature of URI strings. This greatly extends the state of the art, where strings are treated as categorical data that can only be described by enumeration. We then present the open-source STRHist system that implements these ideas. We finally present empirical evaluation results using STRHist over a real dataset and query workload extracted from AGRIS, the most popular and widely used bibliographic database on agricultural research and technology.

[1]  Surajit Chaudhuri,et al.  Exploiting statistics on query expressions for optimization , 2002, SIGMOD '02.

[2]  Peter J. Haas,et al.  Consistent selectivity estimation via maximum entropy , 2007, The VLDB Journal.

[3]  Jeffrey Scott Vitter,et al.  CXHist : An On-line Classification-Based Histogram for XML String Selectivity Estimation , 2005, VLDB.

[4]  Jens Lehmann,et al.  LODStats - An Extensible Framework for High-Performance Dataset Analytics , 2012, EKAW.

[5]  Luis Gravano,et al.  STHoles: a multidimensional workload-aware histogram , 2001, SIGMOD '01.

[6]  Dan Suciu,et al.  Consistent Histograms In The Presence of Distinct Value Counts , 2009, Proc. VLDB Endow..

[7]  Yon Dohn Chung,et al.  Hierarchically organized skew-tolerant histograms for geographic data objects , 2010, SIGMOD Conference.

[8]  Surajit Chaudhuri,et al.  Self-tuning histograms: building histograms without looking at data , 1999, SIGMOD '99.

[9]  Wolfram Wöß,et al.  RDFStats - An Extensible RDF Statistics Generator and Library , 2009, 2009 20th International Workshop on Database and Expert Systems Application.

[10]  Peter J. Haas,et al.  ISOMER: Consistent Histogram Construction Using Query Feedback , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[11]  Antonis Troumpoukis,et al.  SemaGrow: optimizing federated SPARQL queries , 2015, SEMANTiCS.

[12]  Klemens Böhm,et al.  Sensitivity of Self-tuning Histograms: Query Order Affecting Accuracy and Robustness , 2012, SSDBM.

[13]  William E. Winkler,et al.  String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. , 1990 .

[14]  Vangelis Karkaletsis,et al.  Dataset Descriptions for Optimizing Federated Querying , 2015, WWW.

[15]  Johannes Keizer,et al.  Discovering, Indexing and Interlinking Information Resources , 2015, F1000Research.

[16]  Angelos Charalambidis,et al.  Workload-Aware Self-Tuning Histograms of String Data , 2015, DEXA.

[17]  Volker Markl,et al.  LEO - DB2's LEarning Optimizer , 2001, VLDB.

[18]  Gerhard Weikum,et al.  Database Tuning using Online Algorithms , 2009, Encyclopedia of Database Systems.

[19]  Luis Gravano,et al.  Selectivity estimation for string predicates: overcoming the underestimation problem , 2004, Proceedings. 20th International Conference on Data Engineering.

[20]  Jürgen Umbrich,et al.  Data summaries for on-demand queries over linked data , 2010, WWW '10.

[21]  Timothy W. Finin,et al.  Swoogle: a search and metadata engine for the semantic web , 2004, CIKM '04.