Modeling and Predicting Term Mismatch for Full-Text Retrieval

The probability that a term appears in a relevant document is a fundamental quantity in the theory of probabilistic information retrieval, however prior research provided few clues about how to estimate it reliably. Since this probability measures how likely it is that a term has to appear in a document in order for the document to be relevant, in this thesis, it is called term necessity. Equivalently, it is also the proportion of relevant documents that contain the term, thus measures term recall, or the complement of term mismatch. This thesis uses exploratory data analysis to identify common reasons that user-specified query terms fail to match relevant documents, develops features correlated with each reason, and integrates them into a model that can be trained from data. The resulting term necessity predictions can be used as term weights in state-of-the-art retrieval models to improve retrieval accuracy substantially. Feature-based necessity prediction also supports diagnosis and improvement of query components. The thesis research will develop several forms of diagnosis and intervention. The simplest form is interactive feedback in which potential problems with query components are identified for a person to fix. More nuanced approaches to automatic formulations of structured queries are based on interventions that address different causes of term mismatch. For example, removing unnecessary terms, expanding the terms that are likely to mismatch, and weighting term disjunctions after query expansion. Improved weighting of structured query components also provides a new approach to addressing the field-length biases that persist in state-of-the-art retrieval models for structured documents. Collectively, these interventions leverage term necessity predictions to address a variety of common problems related to formation of effective queries.

[1]  Matthew Cooper,et al.  Reverted indexing for feedback and expansion , 2010, CIKM.

[2]  Le Zhao,et al.  Term necessity prediction , 2010, CIKM.

[3]  Dustin Hillard,et al.  Clicked phrase document expansion for sponsored search ad retrieval , 2010, SIGIR '10.

[4]  W. Bruce Croft,et al.  Query term ranking based on dependency parsing of verbose queries , 2010, SIGIR '10.

[5]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[6]  W. Bruce Croft,et al.  Learning concept importance using a weighted dependence model , 2010, WSDM '10.

[7]  W. Bruce Croft,et al.  Query reformulation using anchor text , 2010, WSDM '10.

[8]  Donna K. Harman,et al.  Overview of the Reliable Information Access Workshop , 2009, Information Retrieval.

[9]  ChengXiang Zhai,et al.  A comparative study of methods for estimating query language models with pseudo feedback , 2009, CIKM.

[10]  Le Zhao,et al.  Effective and efficient structured retrieval , 2009, CIKM.

[11]  Fabio Sartori A Comparison of Methods and Techniques for Ontological Query Expansion , 2009, MTSR.

[12]  Vitor R. Carvalho,et al.  Reducing long queries using query quality predictors , 2009, SIGIR.

[13]  W. Bruce Croft,et al.  A Probabilistic Retrieval Model for Semistructured Data , 2009, ECIR.

[14]  James Allan,et al.  Regression Rank: Learning to Meet the Opportunity of Descriptive Queries , 2009, ECIR.

[15]  W. Bruce Croft,et al.  Analysis of long queries in a large scale search log , 2009, WSCD '09.

[16]  Comparison of Classifiers , 2009 .

[17]  Yue Lu,et al.  An empirical study of gene synonym query expansion in biomedical information retrieval , 2008, Information Retrieval.

[18]  Ellen M. Voorhees,et al.  TREC genomics special issue overview , 2009, Information Retrieval.

[19]  Matthew Lease Incorporating relevance and psuedo-relevance feedback in the markov random field model: Brown at the TREC'08 relevance feedback track , 2008 .

[20]  Matthew Lease Incorporating Relevance and Pseudo-relevance Feedback in the Markov Random Field Model , 2008, TREC.

[21]  ChengXiang Zhai,et al.  Mining term association patterns from search logs for effective query reformulation , 2008, CIKM '08.

[22]  Donald Metzler,et al.  Generalized inverse document frequency , 2008, CIKM '08.

[23]  W. Bruce Croft,et al.  Discovering key concepts in verbose queries , 2008, SIGIR '08.

[24]  Stephen E. Robertson,et al.  Selecting Query Term Alternations for Web Search by Exploiting Query Contexts , 2008, ACL.

[25]  Kevyn Collins-Thompson,et al.  Robust model estimation methods for information retrieval , 2008 .

[26]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[27]  Xiaoying Gao,et al.  Exploiting underrepresented query aspects for automatic query expansion , 2007, KDD '07.

[28]  James P. Callan,et al.  Structured retrieval for question answering , 2007, SIGIR.

[29]  Xin Li,et al.  Context sensitive stemming for web search , 2007, SIGIR.

[30]  ChengXiang Zhai,et al.  A study of Poisson query generation model for information retrieval , 2007, SIGIR.

[31]  P. Smith,et al.  A review of ontology based query expansion , 2007, Inf. Process. Manag..

[32]  Douglas W. Oard,et al.  Overview of the TREC 2007 Legal Track , 2007, TREC.

[33]  Le Zhao,et al.  Stuctured Queries for Legal Search , 2007, TREC.

[34]  Tao Qin,et al.  LETOR: Benchmark Dataset for Research on Learning to Rank for Information Retrieval , 2007 .

[35]  Andrei Z. Broder,et al.  Effective and efficient classification on a search-engine model , 2007, Knowledge and Information Systems.

[36]  Elad Yom-Tov,et al.  What makes a query difficult? , 2006, SIGIR.

[37]  Andrew Trotman,et al.  Why structural hints in queries do not help XML-retrieval , 2006, SIGIR.

[38]  Benjamin Rey,et al.  Generating query substitutions , 2006, WWW '06.

[39]  Marc Eisenstadt,et al.  Exploiting Semantic Association To Answer 'Vague Queries' , 2006, AMT.

[40]  Douglas W. Oard,et al.  TREC 2006 Legal Track Overview , 2006, TREC.

[41]  Koby Crammer,et al.  Online Large-Margin Training of Dependency Parsers , 2005, ACL.

[42]  Zhenyu Liu,et al.  Knowledge-based query expansion to support scenario-specific retrieval of medical free text , 2005, SAC '05.

[43]  D. Losada Language modeling for sentence retrieval : A comparison between Multiple-Bernoulli models and Multinomial models , 2005 .

[44]  W. Bruce Croft,et al.  Indri : A language-model based search engine for complex queries ( extended version ) , 2005 .

[45]  W. Bruce Croft,et al.  Formal multiple-bernoulli models for language modeling , 2004, SIGIR '04.

[46]  Hugh E. Williams,et al.  Query association surrogates for Web search: Research Articles , 2004 .

[47]  W. Bruce Croft,et al.  Indri at TREC 2004: Terabyte Track , 2004, TREC.

[48]  Harvey Starr,et al.  Necessary conditions : theory, methodology, and applications , 2003 .

[49]  Thorsten Brants,et al.  Natural Language Processing in Information Retrieval , 2003, CLIN.

[50]  Kotagiri Ramamohanarao,et al.  Long-Term Learning for Web Search Engines , 2002, PKDD.

[51]  W. Bruce Croft,et al.  Predicting query performance , 2002, SIGIR '02.

[52]  William M. Pottenger,et al.  Detecting Patterns in the LSI Term-Term Matrix , 2002 .

[53]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[54]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[55]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[56]  Stephen J. Green,et al.  Linguistic Knowledge can Improve Information Retrieval , 2000, ANLP.

[57]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[58]  Alan F. Smeaton,et al.  Using NLP or NLP Resources for Information Retrieval Tasks , 1999 .

[59]  Eugene Charniak,et al.  Determining the specificity of nouns from text , 1999, EMNLP.

[60]  Chris Buckley,et al.  Improving automatic query expansion , 1998, SIGIR '98.

[61]  Loanne Snavely,et al.  Designs for Active Learning: A Sourcebook of Classroom Strategies for Information Education. , 1998 .

[62]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.

[63]  W. R. Grei,et al.  A theory of term weighting based on exploratory data analysis , 1998, SIGIR 1998.

[64]  Samuel S. L. To,et al.  Passage-Based Re nement ( MultiText Experiments for TREC-6 ) , 1998 .

[65]  W. Bruce Croft,et al.  A language modeling approach to information retrieval , 1998, SIGIR '98.

[66]  Marti A. Hearst Improving Full-Text Precision on Short Queries using Simple Constraints , 1996 .

[67]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[68]  James Allan,et al.  Recent Experiments with INQUERY , 1995, TREC.

[69]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[70]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[71]  Ellen M. Voorhees,et al.  Query expansion using lexical-semantic relations , 1994, SIGIR '94.

[72]  Fredric C. Gey,et al.  Inferring probability of relevance using the method of logistic regression , 1994, SIGIR '94.

[73]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[74]  Jati K. Sengupta,et al.  Introduction to Information , 1993 .

[75]  Fredric C. Gey,et al.  Probabilistic retrieval based on staged logistic regression , 1992, SIGIR '92.

[76]  W. Bruce Croft,et al.  The INQUERY Retrieval System , 1992, DEXA.

[77]  Chris Buckley,et al.  A probabilistic learning approach for document indexing , 1991, TOIS.

[78]  Alan F. Smeaton,et al.  Natural language processing and information retrieval , 1990, Inf. Process. Manag..

[79]  W. Bruce Croft,et al.  Inference networks for document retrieval , 1989, SIGIR '90.

[80]  W. Bruce Croft,et al.  An approach to natural language for document retrieval , 1987, SIGIR '87.

[81]  Susan T. Dumais,et al.  The vocabulary problem in human-system communication , 1987, CACM.

[82]  Bert R. Boyce,et al.  Online information retrieval concepts, principles, and techniques , 1987, J. Am. Soc. Inf. Sci..

[83]  C. Edwards,et al.  Information Technology and the Law , 1986 .

[84]  P. Bryan Heidorn,et al.  Dependency Parsing for Information Retrieval , 1984, SIGIR.

[85]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[86]  W. Bruce Croft,et al.  Using Probabilistic Models of Document Retrieval without Relevance Information , 1979, J. Documentation.

[87]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[88]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[89]  F. W. Lancaster,et al.  Information retrieval systems; characteristics, testing, and evaluation , 1968 .

[90]  D. R. Lewis British Computer Society , 1957, Nature.