Efficient Prediction of Difficult Keyword Queries over Databases

Keyword queries on databases provide easy access to data, but often suffer from low ranking quality, i.e., low precision and/or recall, as shown in recent benchmarks. It would be useful to identify queries that are likely to have low ranking quality to improve the user satisfaction. For instance, the system may suggest to the user alternative queries for such hard queries. In this paper, we analyze the characteristics of hard queries and propose a novel framework to measure the degree of difficulty for a keyword query over a database, considering both the structure and the content of the database and the query results. We evaluate our query difficulty prediction model against two effectiveness benchmarks for popular keyword search ranking methods. Our empirical results show that our model predicts the hard queries with high accuracy. Further, we present a suite of optimizations to minimize the incurred time overhead.

[1]  W. Bruce Croft,et al.  Ranking robustness: a novel framework to predict query performance , 2006, CIKM '06.

[2]  Iadh Ounis,et al.  Query performance prediction , 2006, Inf. Syst..

[3]  Djoerd Hiemstra,et al.  Query Performance Prediction: Evaluation Contrasted with Effectiveness , 2010, ECIR.

[4]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[5]  Javed A. Aslam,et al.  Query Hardness Estimation Using Jensen-Shannon Divergence Among Multiple Scoring Functions , 2007, ECIR.

[6]  CarmelDavid,et al.  Predicting Query Performance by Query-Drift Estimation , 2012 .

[7]  S. Sudarshan,et al.  Keyword searching and browsing in databases using BANKS , 2002, Proceedings 18th International Conference on Data Engineering.

[8]  Oren Kurland,et al.  Predicting Query Performance by Query-Drift Estimation , 2009, TOIS.

[9]  Marko Grobelnik,et al.  SemSearch'11: the 4th semantic search workshop , 2011, WWW.

[10]  Elad Yom-Tov,et al.  Learning to estimate query difficulty: including applications to missing content detection and distributed information retrieval , 2005, SIGIR '05.

[11]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[12]  W. Bruce Croft,et al.  Predicting query performance , 2002, SIGIR '02.

[13]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[14]  Falk Scholer,et al.  Effective Pre-retrieval Query Performance Prediction Using Similarity and Variability Evidence , 2008, ECIR.

[15]  Djoerd Hiemstra,et al.  The Combination and Evaluation of Query Performance Prediction Methods , 2009, ECIR.

[16]  W. Bruce Croft,et al.  Query performance prediction in web search environments , 2007, SIGIR.

[17]  Paul N. Bennett,et al.  Predicting Query Performance via Classification , 2010, ECIR.

[18]  Peter Fankhauser,et al.  DivQ: diversification for keyword search over structured databases , 2010, SIGIR.

[19]  Yeye He,et al.  Keyword++ , 2010, Proc. VLDB Endow..

[20]  Marianne Winslett,et al.  How schema independent are schema free query interfaces? , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[21]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[22]  Oren Kurland,et al.  Back to the roots: a probabilistic framework for query-performance prediction , 2012, CIKM.

[23]  Vagelis Hristidis,et al.  Predicting the effectiveness of keyword queries on databases , 2012, CIKM.

[24]  Xuemin Lin,et al.  SPARK2: Top-k Keyword Query in Relational Databases , 2007, IEEE Transactions on Knowledge and Data Engineering.

[25]  Subhabrata Chakraborti,et al.  Nonparametric Statistical Inference , 2011, International Encyclopedia of Statistical Science.

[26]  Andrew Trotman,et al.  Overview of the INEX 2010 Data Centric Track , 2010, INEX.

[27]  H. V. Jagadish,et al.  Assisted querying using instant-response interfaces , 2007, SIGMOD '07.

[28]  Luis Gravano,et al.  Efficient IR-Style Keyword Search over Relational Databases , 2003, VLDB.

[29]  Oren Kurland,et al.  A Unified Framework for Post-Retrieval Query-Performance Prediction , 2011, ICTIR.

[30]  W. Bruce Croft,et al.  A Probabilistic Retrieval Model for Semistructured Data , 2009, ECIR.

[31]  Ricardo Baeza-Yates,et al.  Improved query difficulty prediction for the web , 2008, CIKM '08.

[32]  Panayiotis Tsaparas,et al.  Structured annotations of web queries , 2010, SIGMOD Conference.

[33]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..