Term-specific smoothing for the language modeling approach to information retrieval: the importance of a query term

This paper follows a formal approach to information retrieval based on statistical language models. By introducing some simple reformulations of the basic language modeling approach we introduce the notion of importance of a query term. The importance of a query term is an unknown parameter that explicitly models which of the query terms are generated from the relevant documents (the important terms), and which are not (the unimportant terms). The new language modeling approach is shown to explain a number of practical facts of today's information retrieval systems that are not very well explained by the current state of information retrieval theory, including stop words, mandatory terms, coordination level ranking and retrieval using phrases.

[1]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[2]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[3]  Chris Buckley,et al.  Optimization of inverted vector searches , 1985, SIGIR '85.

[4]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[5]  Michael Persin,et al.  Document filtering for fast ranking , 1994, SIGIR '94.

[6]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[7]  Djoerd Hiemstra,et al.  Language models and probability of relevance , 2001 .

[8]  David Hawking,et al.  Relevance weighting using distance between term occurrences , 1996 .

[9]  Ron Sacks-Davis,et al.  Similarity Measures for Short Queries , 1995, TREC.

[10]  Wanda Pratt,et al.  Transparent Queries: investigation users' mental models of search engines , 2001, SIGIR '01.

[11]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[12]  Djoerd Hiemstra,et al.  Relating the new language models of information retrieval to the traditional retrieval models , 2000 .

[13]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[14]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[15]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[16]  Charles L. A. Clarke,et al.  Relevance ranking for one to three term queries , 1997, Inf. Process. Manag..

[17]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[18]  C. Lee Giles,et al.  Accessibility of information on the Web , 2000, INTL.

[19]  W. Bruce Croft,et al.  A general language model for information retrieval , 1999, CIKM '99.

[20]  Daniel E. Rose,et al.  V-Twin: A Lightweight Engine for Interactive Use , 1996, TREC.

[21]  Djoerd Hiemstra,et al.  Twenty-One at TREC7: Ad-hoc and Cross-Language Track , 1998, TREC.

[22]  John D. Lafferty,et al.  Information Retrieval as Statistical Translation , 2017 .

[23]  Kenney Ng A Maximum Likelihood Ratio Information Retrieval Model , 1999, TREC.

[24]  Richard M. Schwartz,et al.  A hidden Markov model information retrieval system , 1999, SIGIR '99.

[25]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[26]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[27]  Michael I. Jordan Learning in Graphical Models , 1999, NATO ASI Series.

[28]  Djoerd Hiemstra,et al.  A probabilistic justification for using tf×idf term weighting in information retrieval , 2000, International Journal on Digital Libraries.

[29]  Djoerd Hiemstra,et al.  The Importance of Prior Probabilities for Entry Page Search , 2002, SIGIR '02.

[30]  S. Griffis EDITOR , 1997, Journal of Navigation.

[31]  Djoerd Hiemstra,et al.  Predicting the cost-quality trade-off for information retrieval queries: facilitating database design and query optimization , 2001, CIKM '01.