Topic-based Selectivity Estimation for Hybrid Queries over RDF Graphs

The Resource Description Framework (RDF) has become an accepted standard for describing entities on the Web. Many such RDF descriptions are text-rich – besides structured data, they also feature large portions of unstructured text. As a result, RDF data is frequently queried using predicates matching structured data, combined with string predicates for textual constraints: hybrid queries. Evaluating hybrid queries requires accurate means for selectivity estimation. Previous works on selectivity estimation, however, suffer from inherent drawbacks, reflected in efficiency and effective issues. In this paper, we present a general framework for hybrid selectivity estimation. Based on its requirements, we study the applicability of existing approaches. Driven by our findings, we propose a novel estimation approach, TopGuess, exploiting topic models as data synopsis. This enables us to capture correlations between structured and unstructured data in a uniform and scalable manner. We study TopGuess in theorical manner, and show TopGuess to guarantee a linear space complexity w.r.t. text data size, and a selectivity estimation time complexity independent from its synopsis size. In experiments on real-world data, TopGuess allowed for great improvements in estimation accuracy, without sacrificing runtime performance

[1]  Gregory F. Cooper,et al.  The Computational Complexity of Probabilistic Inference Using Bayesian Belief Networks , 1990, Artif. Intell..

[2]  Rajeev Rastogi,et al.  Independence is good: dependency-based histogram synopses for high-dimensional data , 2001, SIGMOD '01.

[3]  Ben Taskar,et al.  Selectivity estimation using probabilistic models , 2001, SIGMOD '01.

[4]  Guido Moerkotte,et al.  Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[5]  Chengfei Liu,et al.  Estimating selectivity for joined RDF triple patterns , 2011, CIKM '11.

[6]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[7]  Jianmin Wang,et al.  SPARK2: Top-k Keyword Query in Relational Databases , 2011, IEEE Trans. Knowl. Data Eng..

[8]  Alfred C. Weaver,et al.  A framework for evaluating database keyword search strategies , 2010, CIKM.

[9]  Andreas Wagner,et al.  Selectivity estimation for hybrid queries over text-rich data graphs , 2013, EDBT '13.

[10]  David M. Blei,et al.  Relational Topic Models for Document Networks , 2009, AISTATS.

[11]  Sridhar Ramaswamy,et al.  Join synopses for approximate query answering , 1999, SIGMOD '99.

[12]  Kyuseok Shim,et al.  Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance , 2007, VLDB.

[13]  Arindam Banerjee,et al.  Topic Models over Text Streams: A Study of Batch and Online Unsupervised Learning , 2007, SDM.

[14]  Ben Taskar,et al.  Probabilistic Classification and Clustering in Relational Data , 2001, IJCAI.

[15]  Alexander J. Smola,et al.  An architecture for parallel topic models , 2010, Proc. VLDB Endow..

[16]  Chen Li,et al.  Selectivity Estimation for Fuzzy String Predicates in Large Data Sets , 2005, VLDB.

[17]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[18]  Christian S. Jensen,et al.  Lightweight graphical models for selectivity estimation without independence assumptions , 2011, Proc. VLDB Endow..

[19]  Nicholas I. M. Gould,et al.  On the Complexity of Steepest Descent, Newton's and Regularized Newton's Methods for Nonconvex Unconstrained Optimization Problems , 2010, SIAM J. Optim..

[20]  Dave Reynolds,et al.  SPARQL basic graph pattern optimization using selectivity estimation , 2008, WWW.

[21]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[22]  Yee Whye Teh,et al.  Variational Inference for the Indian Buffet Process , 2009, AISTATS.

[23]  Rudi Studer,et al.  TRM - Learning Dependencies between Text and Structure with Topical Relational Models , 2013, SEMWEB.

[24]  Neoklis Polyzotis,et al.  Graph-based synopses for relational selectivity estimation , 2006, SIGMOD Conference.

[25]  Daisy Zhe Wang,et al.  Selectivity estimation for extraction operators over text data , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[26]  Nir Friedman,et al.  Probabilistic Graphical Models , 2009, Data-Driven Computational Neuroscience.

[27]  Yan Liu,et al.  Topic-link LDA: joint models of topic and author community , 2009, ICML '09.

[28]  Veli Bicer,et al.  Search Relevance based on the Semantic Web , 2012 .

[29]  Luis Gravano,et al.  Selectivity estimation for string predicates: overcoming the underestimation problem , 2004, Proceedings. 20th International Conference on Data Engineering.

[30]  Jiming Liu,et al.  Multirelational Topic Models , 2009, 2009 Ninth IEEE International Conference on Data Mining.