Generalized inverse document frequency

Inverse document frequency (IDF) is one of the most useful and widely used concepts in information retrieval. There have been various attempts to provide theoretical justifications for IDF. One of the most appealing derivations follows from the Robertson-Sparck Jones relevance weight. However, this derivation, and others related to it, typically make a number of strong assumptions that are often glossed over. In this paper, we re-examine these assumptions from a Bayesian perspective, discuss possible alternatives, and derive a new, more generalized form of IDF that we call generalized inverse document frequency. In addition to providing theoretical insights into IDF, we also undertake a rigorous empirical evaluation that shows generalized IDF outperforms classical versions of IDF on a number of ad hoc retrieval tasks.

[1]  Stephen P. Harter,et al.  A probabilistic approach to automatic keyword indexing , 1974 .

[2]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[3]  W. Bruce Croft,et al.  Using Probabilistic Models of Document Retrieval without Relevance Information , 1979, J. Documentation.

[4]  Stephen E. Robertson,et al.  Probabilistic models of indexing and searching , 1980, SIGIR '80.

[5]  William S. Cooper,et al.  Some inconsistencies and misnomers in probabilistic information retrieval , 1991, SIGIR '91.

[6]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[7]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[8]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[9]  Amit Singhal,et al.  Pivoted document length normalization , 1996, SIGIR 1996.

[10]  Stephen E. Robertson,et al.  On relevance weights with little relevance information , 1997, SIGIR '97.

[11]  S. Robertson The probability ranking principle in IR , 1997 .

[12]  Warren R. Greiff,et al.  A theory of term weighting based on exploratory data analysis , 1998, SIGIR '98.

[13]  Kishore Papineni,et al.  Why Inverse Document Frequency? , 2001, NAACL.

[14]  Akiko Aizawa,et al.  An information-theoretic perspective of tf-idf measures , 2003, Inf. Process. Manag..

[15]  Thomas Roelleke A frequency-based and a poisson-based definition of the probability of being informative , 2003, SIGIR '03.

[16]  ChengXiang Zhai,et al.  Probabilistic Relevance Models Based on Document and Query Generation , 2003 .

[17]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[18]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[19]  Arjen P. de Vries,et al.  Relevance information: a loss of entropy but a gain for IDF? , 2005, SIGIR '05.

[20]  W. Bruce Croft,et al.  Indri: A language-model based search engine for complex queries1 , 2005 .

[21]  Iadh Ounis,et al.  On setting the hyper-parameters of term frequency normalization for information retrieval , 2007, TOIS.

[22]  Lillian Lee,et al.  IDF revisited: a simple new derivation within the Robertson-Spärck Jones probabilistic model , 2007, SIGIR.

[23]  Ellen M. Voorhees,et al.  Bias and the limits of pooling for large collections , 2007, Information Retrieval.

[24]  Ram Akella,et al.  A new probabilistic retrieval model based on the dirichlet compound multinomial distribution , 2008, SIGIR '08.

[25]  Victor Lavrenko,et al.  A Generative Theory of Relevance , 2008, The Information Retrieval Series.