Predicting the effectiveness of keyword queries on databases

Keyword query interfaces (KQIs) for databases provide easy access to data, but often suffer from low ranking quality, i.e. low precision and/or recall, as shown in recent benchmarks. It would be useful to be able to identify queries that are likely to have low ranking quality to improve the user satisfaction. For instance, the system may suggest to the user alternative queries for such hard queries. In this paper, we analyze the characteristics of hard queries and propose a novel framework to measure the degree of difficulty for a keyword query over a database, considering both the structure and the content of the database and the query results. We evaluate our query difficulty prediction model against two relevance judgment benchmarks for keyword search on databases, INEX and SemSearch. Our study shows that our model predicts the hard queries with high accuracy. Further, our prediction algorithms incur minimal time overhead.

[1]  Jeffrey Xu Yu,et al.  Keyword Search in Relational Databases: A Survey , 2010, IEEE Data Eng. Bull..

[2]  Yi Chen,et al.  Reasoning and identifying relevant matches for XML keyword search , 2008, Proc. VLDB Endow..

[3]  Shlomo Geva Comparative evaluation of focused retrieval : 9th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2010, Vught, the Netherlands, December 13-15, 2010 : revised selected papers , 2011 .

[4]  Gerard Salton,et al.  Length Normalization in Degraded Text Collections , 1995 .

[5]  Marianne Winslett,et al.  How schema independent are schema free query interfaces? , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[6]  Marianne Winslett,et al.  Using structural information in XML keyword search effectively , 2011, TODS.

[7]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[8]  Ohad Shamir,et al.  Cluster Stability for Finite Samples , 2007, NIPS.

[9]  Jean Dickinson Gibbons,et al.  Nonparametric Statistical Inference , 1972, International Encyclopedia of Statistical Science.

[10]  Sriram Raghavan,et al.  Understanding queries in a search database system , 2010, PODS '10.

[11]  Xuemin Lin,et al.  SPARK2: Top-k Keyword Query in Relational Databases , 2007, IEEE Transactions on Knowledge and Data Engineering.

[12]  Andrew Trotman,et al.  Overview of the INEX 2010 Data Centric Track , 2010, INEX.

[13]  H. V. Jagadish,et al.  Assisted querying using instant-response interfaces , 2007, SIGMOD '07.

[14]  Luis Gravano,et al.  Efficient IR-Style Keyword Search over Relational Databases , 2003, VLDB.

[15]  Surajit Chaudhuri,et al.  DBXplorer: a system for keyword-based search over relational databases , 2002, Proceedings 18th International Conference on Data Engineering.

[16]  Tok Wang Ling,et al.  Effective XML Keyword Search with Relevance Oriented Ranking , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[17]  W. Bruce Croft,et al.  Predicting query performance , 2002, SIGIR '02.

[18]  Paul N. Bennett,et al.  Predicting Query Performance via Classification , 2010, ECIR.

[19]  Jeffrey Xu Yu,et al.  Keyword search in databases: the power of RDBMS , 2009, SIGMOD Conference.

[20]  W. Bruce Croft,et al.  A Probabilistic Retrieval Model for Semistructured Data , 2009, ECIR.

[21]  Lin Guo XRANK : Ranked Keyword Search over XML Documents , 2003 .

[22]  Joachim M. Buhmann,et al.  Stability-Based Validation of Clustering Solutions , 2004, Neural Computation.

[23]  Gerhard Weikum,et al.  The TopX DB&IR engine , 2007, SIGMOD '07.

[24]  Ricardo Baeza-Yates,et al.  Improved query difficulty prediction for the web , 2008, CIKM '08.

[25]  D. Rossetti Poems: HE AND I , 2013 .

[26]  Panayiotis Tsaparas,et al.  Structured annotations of web queries , 2010, SIGMOD Conference.

[27]  Jun Wang,et al.  Portfolio theory of information retrieval , 2009, SIGIR.

[28]  Yi Zhang,et al.  Query Difficulty Prediction for Contextual Image Retrieval , 2010, ECIR.

[29]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[30]  Peter Fankhauser,et al.  DivQ: diversification for keyword search over structured databases , 2010, SIGIR.

[31]  Yeye He,et al.  Keyword++ , 2010, Proc. VLDB Endow..

[32]  W. Bruce Croft,et al.  Ranking robustness: a novel framework to predict query performance , 2006, CIKM '06.

[33]  Iadh Ounis,et al.  Query performance prediction , 2006, Inf. Syst..

[34]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[35]  Emanuele Della Valle,et al.  An Introduction to Information Retrieval , 2013 .

[36]  Clement T. Yu,et al.  Effective keyword search in relational databases , 2006, SIGMOD Conference.

[37]  Marko Grobelnik,et al.  SemSearch'11: the 4th semantic search workshop , 2011, WWW.

[38]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[39]  Elad Yom-Tov,et al.  Learning to estimate query difficulty: including applications to missing content detection and distributed information retrieval , 2005, SIGIR '05.

[40]  Ellen M. Voorhees,et al.  Overview of the TREC 2004 Robust Retrieval Track , 2004 .

[41]  Oren Kurland,et al.  Using statistical decision theory and relevance models for query-performance prediction , 2010, SIGIR.

[42]  S. Sudarshan,et al.  Keyword searching and browsing in databases using BANKS , 2002, Proceedings 18th International Conference on Data Engineering.