TF-IDF uncovered: a study of theories and probabilities

Interpretations of TF-IDF are based on binary independence retrieval, Poisson, information theory, and language modelling. This paper contributes a review of existing interpretations, and then, TF-IDF is systematically related to the probabilities <i>P</i>(<i>q</i>|<i>d</i>) and <i>P</i>(<i>d</i>|<i>q</i>). Two approaches are explored: a space of <i>independent</i>, and a space of <i>disjoint</i> terms. For <i>independent</i> terms, an "extreme" query/non-query term assumption uncovers TF-IDF, and an analogy of <i>P</i>(<i>d</i>|<i>q</i>) and the probabilistic odds <i>O</i>(<i>r</i>|<i>d</i>, <i>q</i>) mirrors relevance feedback. For <i>disjoint</i> terms, a relationship between probability theory and TF-IDF is established through the integral + 1/<i>x</i> d<i>x</i> = log <i>x</i>. This study uncovers components such as divergence from randomness and pivoted document length to be inherent parts of a document-query independence (DQI) measure, and interestingly, an integral of the DQI over the term occurrence probability leads to TF-IDF.

[1]  ChengXiang Zhai,et al.  Probabilistic Relevance Models Based on Document and Query Generation , 2003 .

[2]  Fausto Rabitti,et al.  Proceedings of the 9th annual international ACM SIGIR conference on Research and development in information retrieval , 1986 .

[3]  Thomas Roelleke,et al.  A parallel derivation of probabilistic information retrieval models , 2006, SIGIR.

[4]  Yiyu Yao,et al.  On modeling information retrieval with probabilistic inference , 1995, TOIS.

[5]  ChengXiang Zhai,et al.  Semantic term matching in axiomatic approaches to information retrieval , 2006, SIGIR.

[6]  Mounia Lalmas,et al.  SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval , 2006 .

[7]  W. Bruce Croft,et al.  Using Probabilistic Models of Document Retrieval without Relevance Information , 1979, J. Documentation.

[8]  ChengXiang Zhai,et al.  A study of Poisson query generation model for information retrieval , 2007, SIGIR.

[9]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[10]  Ophir Frieder,et al.  Information Retrieval: Algorithms and Heuristics (The Kluwer International Series on Information Retrieval) , 2004 .

[11]  Alistair Moffat,et al.  Exploring the similarity space , 1998, SIGF.

[12]  Stephen E. Robertson,et al.  Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[13]  Thomas Roelleke A frequency-based and a poisson-based definition of the probability of being informative , 2003, SIGIR '03.

[14]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[15]  Arjen P. de Vries,et al.  Relevance information: a loss of entropy but a gain for IDF? , 2005, SIGIR '05.

[16]  Djoerd Hiemstra,et al.  Bayesian extension to the language model for ad hoc information retrieval , 2003, SIGIR.

[17]  Hany Azzam,et al.  Modelling retrieval models in a probabilistic relational algebra with a new operator: the relational Bayes , 2007, The VLDB Journal.

[18]  Alan F. Smeaton,et al.  Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval , 2003, SIGIR 2003.

[19]  Djoerd Hiemstra,et al.  A probabilistic justification for using tf×idf term weighting in information retrieval , 2000, International Journal on Digital Libraries.

[20]  Akiko Aizawa,et al.  An information-theoretic perspective of tf-idf measures , 2003, Inf. Process. Manag..

[21]  Kenneth Ward Church,et al.  Inverse Document Frequency (IDF): A Measure of Deviations from Poisson , 1995, VLC@ACL.

[22]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[23]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..