Automatic term mismatch diagnosis for selective query expansion

People are seldom aware that their search queries frequently mismatch a majority of the relevant documents. This may not be a big problem for topics with a large and diverse set of relevant documents, but would largely increase the chance of search failure for less popular search needs. We aim to address the mismatch problem by developing accurate and simple queries that require minimal effort to construct. This is achieved by targeting retrieval interventions at the query terms that are likely to mismatch relevant documents. For a given topic, the proportion of relevant documents that do not contain a term measures the probability for the term to mismatch relevant documents, or the term mismatch probability. Recent research demonstrates that this probability can be estimated reliably prior to retrieval. Typically, it is used in probabilistic retrieval models to provide query dependent term weights. This paper develops a new use: Automatic diagnosis of term mismatch. A search engine can use the diagnosis to suggest manual query reformulation, guide interactive query expansion, guide automatic query expansion, or motivate other responses. The research described here uses the diagnosis to guide interactive query expansion, and create Boolean conjunctive normal form (CNF) structured queries that selectively expand 'problem' query terms while leaving the rest of the query untouched. Experiments with TREC Ad-hoc and Legal Track datasets demonstrate that with high quality manual expansion, this diagnostic approach can reduce user effort by 33%, and produce simple and effective structured queries that surpass their bag of word counterparts.

[1]  M. Butler Information Retrieval Systems Characteristics, Testing, and Evaluation , 1970 .

[2]  Edward A. Fox,et al.  Research Contributions , 2014 .

[3]  S. E. Brodie New York, New York, USA , 1996 .

[4]  Marti A. Hearst Improving Full-Text Precision on Short Queries using Simple Constraints , 1996 .

[5]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[6]  Structured Queries for Legal Documents Search , 2007 .

[7]  Warren R. Greiff,et al.  A theory of term weighting based on exploratory data analysis , 1998, SIGIR '98.

[8]  Xin Li,et al.  Context sensitive stemming for web search , 2007, SIGIR.

[9]  J. Aslam,et al.  A Practical Sampling Strategy for Efficient Retrieval Evaluation , 2007 .

[10]  Susan T. Dumais,et al.  The vocabulary problem in human-system communication , 1987, CACM.

[11]  Le Zhao,et al.  Effective and efficient structured retrieval , 2009, CIKM.

[12]  Ellen M. Voorhees,et al.  Overview of the TREC 2006 , 2007, TREC.

[13]  Stephen Tomlinson Experiments with the Negotiated Boolean Queries of the TREC 2007 Legal Discovery Track , 2006, TREC.

[14]  Douglas W. Oard,et al.  Overview of the TREC 2007 Legal Track , 2007, TREC.

[15]  Man Lung Yiu,et al.  Group-by skyline query processing in relational engines , 2009, CIKM.

[16]  W. Bruce Croft,et al.  Query reformulation using anchor text , 2010, WSDM '10.

[17]  Donald Metzler,et al.  Generalized inverse document frequency , 2008, CIKM '08.

[18]  Donna K. Harman,et al.  Overview of the Third Text REtrieval Conference (TREC-3) , 1995, TREC.

[19]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[20]  Jimmy J. Lin,et al.  How do users find things with PubMed?: towards automatic utility evaluation with user simulations , 2008, SIGIR '08.

[21]  D. K. Harmon,et al.  Overview of the Third Text Retrieval Conference (TREC-3) , 1996 .

[22]  Stephen P. Harter,et al.  Online Information Retrieval: Concepts, Principles and Techniques , 1986 .

[23]  ChengXiang Zhai,et al.  Mining term association patterns from search logs for effective query reformulation , 2008, CIKM '08.

[24]  Ellen M. Voorhees,et al.  Overview of TREC 2007 , 2007, TREC.

[25]  Charles L. A. Clarke,et al.  Shortest Substring Ranking (MultiText Experiments for TREC-4) , 1995, TREC.

[26]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[27]  Douglas W. Oard,et al.  TREC 2006 Legal Track Overview , 2006, TREC.

[28]  W. Bruce Croft,et al.  Modeling reformulation using passage analysis , 2010, CIKM '10.

[29]  Benjamin Rey,et al.  Generating query substitutions , 2006, WWW '06.

[30]  Le Zhao,et al.  Term necessity prediction , 2010, CIKM.

[31]  Stephen E. Robertson,et al.  Selecting Query Term Alternations for Web Search by Exploiting Query Contexts , 2008, ACL.

[32]  Chris Buckley,et al.  Improving automatic query expansion , 1998, SIGIR '98.

[33]  W. Bruce Croft,et al.  Combining the language model and inference network approaches to retrieval , 2004, Inf. Process. Manag..

[34]  Ryen W. White,et al.  Evaluating implicit feedback models using searcher simulations , 2005, TOIS.