Holistic and Compact Selectivity Estimation for Hybrid Queries over RDF Graphs

Many RDF descriptions today are text-rich: besides structured data they also feature much unstructured text. Text-rich RDF data is frequently queried via predicates matching structured data, combined with string predicates for textual constraints (hybrid queries). Evaluating hybrid queries efficiently requires means for selectivity estimation. Previous works on selectivity estimation, however, suffer from inherent drawbacks, which are reflected in efficiency and effectiveness issues. We propose a novel estimation approach, TopGuess, which exploits topic models as data synopsis. This way, we capture correlations between structured and unstructured data in a holistic and compact manner. We study TopGuess in a theoretical analysis and show it to guarantee a linear space complexity w.r.t. text data size. Further, we show selectivity estimation time complexity to be independent from the synopsis size. In experiments on real-world data, TopGuess allowed for great improvements in estimation accuracy, without sacrificing efficiency.

[1]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[2]  Jiming Liu,et al.  Multirelational Topic Models , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[3]  Vassilis Christophides,et al.  Heuristics-based query optimisation for SPARQL , 2012, EDBT '12.

[4]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[5]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[6]  David M. Blei,et al.  Relational Topic Models for Document Networks , 2009, AISTATS.

[7]  Sridhar Ramaswamy,et al.  Join synopses for approximate query answering , 1999, SIGMOD '99.

[8]  Luis Gravano,et al.  Selectivity estimation for string predicates: overcoming the underestimation problem , 2004, Proceedings. 20th International Conference on Data Engineering.

[9]  Dave Reynolds,et al.  SPARQL basic graph pattern optimization using selectivity estimation , 2008, WWW.

[10]  Christian S. Jensen,et al.  Lightweight graphical models for selectivity estimation without independence assumptions , 2011, Proc. VLDB Endow..

[11]  Chen Li,et al.  Selectivity Estimation for Fuzzy String Predicates in Large Data Sets , 2005, VLDB.

[12]  Michael I. Jordan,et al.  Learning with Mixtures of Trees , 2001, J. Mach. Learn. Res..

[13]  Rajeev Rastogi,et al.  Independence is good: dependency-based histogram synopses for high-dimensional data , 2001, SIGMOD '01.

[14]  Ben Taskar,et al.  Selectivity estimation using probabilistic models , 2001, SIGMOD '01.

[15]  Kyuseok Shim,et al.  Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance , 2007, VLDB.

[16]  Jeffrey F. Naughton,et al.  Declarative Information Extraction Using Datalog with Embedded Extraction Predicates , 2007, VLDB.

[17]  Andreas Wagner,et al.  Selectivity estimation for hybrid queries over text-rich data graphs , 2013, EDBT '13.

[18]  Yee Whye Teh,et al.  Variational Inference for the Indian Buffet Process , 2009, AISTATS.

[19]  Rudi Studer,et al.  TRM - Learning Dependencies between Text and Structure with Topical Relational Models , 2013, SEMWEB.

[20]  Neoklis Polyzotis,et al.  Graph-based synopses for relational selectivity estimation , 2006, SIGMOD Conference.

[21]  Daisy Zhe Wang,et al.  Selectivity estimation for extraction operators over text data , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[22]  Nir Friedman,et al.  Probabilistic Graphical Models , 2009, Data-Driven Computational Neuroscience.

[23]  Yan Liu,et al.  Topic-link LDA: joint models of topic and author community , 2009, ICML '09.

[24]  Lora Aroyo,et al.  The Semantic Web – ISWC 2013 , 2013, Lecture Notes in Computer Science.

[25]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[26]  Chengfei Liu,et al.  Estimating selectivity for joined RDF triple patterns , 2011, CIKM '11.

[27]  Guido Moerkotte,et al.  Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[28]  Jianmin Wang,et al.  SPARK2: Top-k Keyword Query in Relational Databases , 2011, IEEE Trans. Knowl. Data Eng..

[29]  Alfred C. Weaver,et al.  A framework for evaluating database keyword search strategies , 2010, CIKM.