Modeling and solving term mismatch for full-text retrieval

Even though modern retrieval systems typically use a multitude of features to rank documents, the backbone for search ranking is usually the standard tf.idf retrieval models. This thesis addresses a limitation of the fundamental retrieval models, the term mismatch problem, which happens when query terms fail to appear in the documents that are relevant to the query. The term mismatch problem is a long standing problem in information retrieval. However, it was not well understood how often term mismatch happens in retrieval, how important it is for retrieval, or how it affects retrieval performance. This thesis answers the above questions, and proposes principled solutions to address this limitation. The new understandings of the retrieval models will benefit its users, as well as inform the development of software applications built on top of them. This new direction of research is enabled by the formal definition of the probability of term mismatch, and quantitative data analyses around it. In this thesis, term mismatch is defined as the probability of a term not appearing in a document that is relevant to the query. The complement of term mismatch is the term recall, the probability of a term appearing in relevant documents. Even though the term recall probability is known to be a fundamental quantity in the theory of probabilistic information retrieval, prior research in ad hoc retrieval provided few clues about how to estimate term recall reliably. This dissertation research designs two term mismatch prediction methods. With exploratory data analyses, this research first identifies common reasons that user-specified query terms fail to appear in documents relevant to the query, develops features correlated with each reason, and integrates them into a predictive model that can be trained from data. This prediction model uses training queries with relevance judgments to predict term mismatch for test queries without known relevance, and can be viewed as a type of transfer learning where training queries represent related ranking tasks that are used by the learning algorithm to facilitate the ranking for new test tasks. Further data analyses focus on the variation of the term mismatch probability for the same term across different queries, and demonstrate that query dependent features are needed for effective term mismatch prediction. At the same time, because the cross-query variation of term mismatch is small for most of the repeating term occurrences, a second mismatch prediction method is designed to use historic occurrences of the same term to predict the mismatch probability for its test occurrences. This provides an alternative and more efficient procedure to predict term mismatch. Effective term mismatch predictions can be used in several different ways to improve retrieval. The probabilistic retrieval theory suggests to use the term recall probabilities as term weights in the retrieval models. Experiments on 6 different TREC Ad hoc track and Web track datasets show that this automatic intervention improves both retrieval recall and precision substantially for long queries. Even though term weighting does not substantially improve retrieval accuracy for short queries which typically have a higher baseline performance, much larger gains are possible by solving mismatch using user expanded Conjunctive Normal Form queries. These queries try to fix the mismatch problem by expanding every query term individually. Our method uses the automatic term mismatch predictions as a diagnostic tool to guide interactive interventions, so that the users can expand the query terms that need expansion most. Simulated expansion interactions based on real user-expanded queries on TREC Ad hoc and Legal track datasets show that expanding the terms that have the highest predicted mismatch probabilities effectively improves retrieval performance. The resulting Boolean Conjunctive Normal Form expansion queries are both compact and effective, substantially outperforming the short keyword queries as well as the traditional bag of

[1]  Elad Yom-Tov,et al.  What makes a query difficult? , 2006, SIGIR.

[2]  Edward A. Fox,et al.  Research Contributions , 2014 .

[3]  W. Bruce Croft,et al.  Modeling reformulation using passage analysis , 2010, CIKM '10.

[4]  Djoerd Hiemstra,et al.  Language Modeling and Relevance , 2003 .

[5]  Koby Crammer,et al.  Online Large-Margin Training of Dependency Parsers , 2005, ACL.

[6]  Thorsten Brants,et al.  Natural Language Processing in Information Retrieval , 2003, CLIN.

[7]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[8]  Benjamin Rey,et al.  Generating query substitutions , 2006, WWW '06.

[9]  W. Bruce Croft,et al.  Query reformulation using anchor text , 2010, WSDM '10.

[10]  David C. Blair STAIRS redux: thoughts on the STAIRS evaluation, ten years after , 1996 .

[11]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[12]  Jimmy J. Lin,et al.  How do users find things with PubMed?: towards automatic utility evaluation with user simulations , 2008, SIGIR '08.

[13]  James P. Callan,et al.  Structured retrieval for question answering , 2007, SIGIR.

[14]  D. K. Harmon,et al.  Overview of the Third Text Retrieval Conference (TREC-3) , 1996 .

[15]  Xin Li,et al.  Context sensitive stemming for web search , 2007, SIGIR.

[16]  Edward A. Fox,et al.  Experimental Comparison of Schemes for Interpreting Boolean Queries , 1988 .

[17]  Marti A. Hearst Improving Full-Text Precision on Short Queries using Simple Constraints , 1996 .

[18]  W. Bruce Croft,et al.  The INQUERY Retrieval System , 1992, DEXA.

[19]  Donna K. Harman,et al.  Overview of the Reliable Information Access Workshop , 2009, Information Retrieval.

[20]  John D. Lafferty,et al.  Information retrieval as statistical translation , 1999, SIGIR '99.

[21]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[22]  Douglas W. Oard,et al.  Overview of the TREC 2007 Legal Track , 2007, TREC.

[23]  W. Bruce Croft,et al.  Inference networks for document retrieval , 1989, SIGIR '90.

[24]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[25]  W. R. Grei,et al.  A theory of term weighting based on exploratory data analysis , 1998, SIGIR 1998.

[26]  Jianfeng Gao,et al.  Clickthrough-based translation models for web search: from word models to phrase models , 2010, CIKM.

[27]  Charles L. A. Clarke,et al.  Passage-Based Refinement (MultiText Experiements for TREC-6) , 1997, TREC.

[28]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[29]  W. Bruce Croft,et al.  Using Probabilistic Models of Document Retrieval without Relevance Information , 1979, J. Documentation.

[30]  Kevyn Collins-Thompson,et al.  Robust model estimation methods for information retrieval , 2008 .

[31]  Andrei Z. Broder,et al.  Effective and efficient classification on a search-engine model , 2007, Knowledge and Information Systems.

[32]  F. W. Lancaster,et al.  Information retrieval systems; characteristics, testing, and evaluation , 1968 .

[33]  Fredric C. Gey,et al.  Inferring probability of relevance using the method of logistic regression , 1994, SIGIR '94.

[34]  Le Zhao,et al.  Stuctured Queries for Legal Search , 2007, TREC.

[35]  M. E. Maron,et al.  An evaluation of retrieval effectiveness for a full-text document-retrieval system , 1985, CACM.

[36]  Ellen M. Voorhees,et al.  TREC genomics special issue overview , 2009, Information Retrieval.

[37]  Yiming Yang,et al.  An application of least squares fit mapping to text information retrieval , 1993, SIGIR.

[38]  Chris Buckley,et al.  Improving automatic query expansion , 1998, SIGIR '98.

[39]  ChengXiang Zhai,et al.  Mining term association patterns from search logs for effective query reformulation , 2008, CIKM '08.

[40]  James Allan,et al.  Predicting searcher frustration , 2010, SIGIR.

[41]  W. Bruce Croft,et al.  Discovering key concepts in verbose queries , 2008, SIGIR '08.

[42]  Chris Buckley,et al.  A probabilistic learning approach for document indexing , 1991, TOIS.

[43]  P. Smith,et al.  A review of ontology based query expansion , 2007, Inf. Process. Manag..

[44]  Alan F. Smeaton,et al.  Using NLP or NLP Resources for Information Retrieval Tasks , 1999 .

[45]  Tao Qin,et al.  LETOR: Benchmark Dataset for Research on Learning to Rank for Information Retrieval , 2007 .

[46]  Vitor R. Carvalho,et al.  Reducing long queries using query quality predictors , 2009, SIGIR.

[47]  C. J. van Rijsbergen,et al.  Information Retrieval , 1979, Encyclopedia of GIS.

[48]  James P. Callan,et al.  Document allocation policies for selective searching of distributed indexes , 2010, CIKM '10.

[49]  Harvey Starr,et al.  Necessary conditions : theory, methodology, and applications , 2003 .

[50]  Gerard Salton,et al.  Optimization of relevance feedback weights , 1995, SIGIR '95.

[51]  W. Bruce Croft,et al.  Analysis of long queries in a large scale search log , 2009, WSCD '09.

[52]  Xiaoying Gao,et al.  Exploiting underrepresented query aspects for automatic query expansion , 2007, KDD '07.

[53]  Hugh E. Williams,et al.  Query association surrogates for Web search: Research Articles , 2004 .

[54]  Andrew Y. Ng,et al.  Transfer learning for text classification , 2005, NIPS.

[55]  Donna K. Harman,et al.  Overview of the Fourth Text REtrieval Conference (TREC-4) , 1995, TREC.

[56]  Dustin Hillard,et al.  Clicked phrase document expansion for sponsored search ad retrieval , 2010, SIGIR '10.

[57]  James Allan,et al.  Regression Rank: Learning to Meet the Opportunity of Descriptive Queries , 2009, ECIR.

[58]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[59]  Stephen Tomlinson Experiments with the Negotiated Boolean Queries of the TREC 2007 Legal Discovery Track , 2006, TREC.

[60]  Douglas W. Oard,et al.  TREC 2006 Legal Track Overview , 2006, TREC.

[61]  Robert Krovetz Viewing morphology as an inference process , 2000, Artif. Intell..

[62]  Ellen M. Voorhees Natural Language Processing and Information Retrieval , 1999, SCIE.

[63]  Reed C. Lawlor,et al.  Information Technology and the Law , 1962, Adv. Comput..

[64]  Juliane Fluck,et al.  Prior Art Search in Chemistry Patents Based On Semantic Concepts and Co-Citation Analysis , 2010, TREC.

[65]  J. Aslam,et al.  A Practical Sampling Strategy for Efficient Retrieval Evaluation , 2007 .

[66]  Maria Teresa Pazienza Information Extraction: Towards Scalable, Adaptable Systems , 1999 .

[67]  Bert R. Boyce,et al.  Online information retrieval concepts, principles, and techniques , 1987, J. Am. Soc. Inf. Sci..

[68]  David C. Blair STAIRS Redux: Thoughts on the STAIRS Evaluation, Ten Years after , 1996, J. Am. Soc. Inf. Sci..

[69]  Djoerd Hiemstra,et al.  Language Modelling and Relevance , 2003 .

[70]  W. Bruce Croft,et al.  Formal multiple-bernoulli models for language modeling , 2004, SIGIR '04.

[71]  Emanuele Della Valle,et al.  An Introduction to Information Retrieval , 2013 .

[72]  Clement T. Yu,et al.  Term Weighting in Information Retrieval Using the Term Precision Model , 1982, JACM.

[73]  Donald Metzler,et al.  Generalized inverse document frequency , 2008, CIKM '08.

[74]  Matthew Lease Incorporating Relevance and Pseudo-relevance Feedback in the Markov Random Field Model , 2008, TREC.

[75]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[76]  William M. Pottenger,et al.  Detecting Patterns in the LSI Term-Term Matrix , 2002 .

[77]  Yue Lu,et al.  An empirical study of gene synonym query expansion in biomedical information retrieval , 2008, Information Retrieval.

[78]  Le Zhao,et al.  Automatic term mismatch diagnosis for selective query expansion , 2012, SIGIR '12.

[79]  Ryen W. White,et al.  Evaluating implicit feedback models using searcher simulations , 2005, TOIS.

[80]  Kotagiri Ramamohanarao,et al.  Long-Term Learning for Web Search Engines , 2002, PKDD.

[81]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[82]  W. Bruce Croft,et al.  Query term ranking based on dependency parsing of verbose queries , 2010, SIGIR '10.

[83]  Tie-Yan Liu,et al.  Learning to rank: from pairwise approach to listwise approach , 2007, ICML '07.

[84]  Charles L. A. Clarke,et al.  Shortest Substring Ranking (MultiText Experiments for TREC-4) , 1995, TREC.

[85]  David A. Hull,et al.  Dean of Graduate Studies , 2000 .

[86]  ChengXiang Zhai,et al.  A study of Poisson query generation model for information retrieval , 2007, SIGIR.

[87]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[88]  W. Bruce Croft,et al.  Combining the language model and inference network approaches to retrieval , 2004, Inf. Process. Manag..

[89]  James Allan,et al.  Recent Experiments with INQUERY , 1995, TREC.

[90]  W. Bruce Croft,et al.  An approach to natural language for document retrieval , 1987, SIGIR '87.

[91]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[92]  Howard R. Turtle Natural language vs. Boolean query evaluation: a comparison of retrieval performance , 1994, SIGIR '94.

[93]  Donna K. Harman,et al.  Overview of the Third Text REtrieval Conference (TREC-3) , 1995, TREC.

[94]  Eugene Charniak,et al.  Determining the specificity of nouns from text , 1999, EMNLP.

[95]  W. Bruce Croft,et al.  Predicting query performance , 2002, SIGIR '02.

[96]  Min Zhang,et al.  TREC-10 Web Track Experiments at MSRA , 2001, TREC.

[97]  Stephen E. Robertson,et al.  Selecting Query Term Alternations for Web Search by Exploiting Query Contexts , 2008, ACL.

[98]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[99]  W. Bruce Croft,et al.  Indri at TREC 2004: Terabyte Track , 2004, TREC.

[100]  P. Bryan Heidorn,et al.  Dependency Parsing for Information Retrieval , 1984, SIGIR.

[101]  Warren R. Greiff,et al.  A theory of term weighting based on exploratory data analysis , 1998, SIGIR '98.

[102]  Stephen J. Green,et al.  Linguistic Knowledge can Improve Information Retrieval , 2000, ANLP.

[103]  ChengXiang Zhai,et al.  A comparative study of methods for estimating query language models with pseudo feedback , 2009, CIKM.

[104]  Marc Eisenstadt,et al.  Exploiting Semantic Association To Answer 'Vague Queries' , 2006, AMT.

[105]  W. Bruce Croft,et al.  Learning concept importance using a weighted dependence model , 2010, WSDM '10.

[106]  Matthew Cooper,et al.  Reverted indexing for feedback and expansion , 2010, CIKM.

[107]  D. R. Swanson,et al.  An introduction to Medline searching. , 2003 .

[108]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[109]  Susan T. Dumais,et al.  The vocabulary problem in human-system communication , 1987, CACM.

[110]  Le Zhao,et al.  Effective and efficient structured retrieval , 2009, CIKM.

[111]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[112]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[113]  Le Zhao,et al.  Term necessity prediction , 2010, CIKM.

[114]  W. Bruce Croft,et al.  Indri : A language-model based search engine for complex queries ( extended version ) , 2005 .

[115]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.

[116]  Fredric C. Gey,et al.  Probabilistic retrieval based on staged logistic regression , 1992, SIGIR '92.

[117]  D. Losada Language modeling for sentence retrieval : A comparison between Multiple-Bernoulli models and Multinomial models , 2005 .