Towards a Better Understanding of the Relationship between Probabilistic Models in IR

Probability of relevance (PR) models are generally assumed to implement the Probability Ranking Principle (PRP) of IR, and recent publications claim that PR models and language models are similar. However, a careful analysis reveals two gaps in the chain of reasoning behind this statement. First, the PRP considers the relevance of particular documents, whereas PR models consider the relevance of any query-document pair. Second, unlike PR models, language models consider draws of terms and documents. We bridge the first gap by showing how the probability measure of PR models can be used to define the probabilistic model of the PRP. Furthermore, we argue that given the differences between PR models and language models, the second gap cannot be bridged at the probabilistic model level. We instead define a new PR model based on logistic regression, which has a similar score function to the one of the query likelihood model. The performance of both models is strongly correlated, hence providing a bridge for the second gap at the functional and ranking level. Understanding language models in relation with logistic regression models opens ample new research directions which we propose as future work.

[1]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[2]  H. S. Dhami,et al.  Language Model for Information Retrieval , 2010 .

[3]  W. Bruce Croft,et al.  Relevance Models in Information Retrieval , 2003 .

[4]  ChengXiang Zhai,et al.  Probabilistic Relevance Models Based on Document and Query Generation , 2003 .

[5]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[6]  Norbert Fuhr,et al.  Probabilistic Models in Information Retrieval , 1992, Comput. J..

[7]  S. Robertson The probability ranking principle in IR , 1997 .

[8]  Rong Yan,et al.  Probabilistic models for combining diverse knowledge sources in multimedia retrieval , 2006 .

[9]  Djoerd Hiemstra,et al.  Language Modelling and Relevance , 2003 .

[10]  W. Bruce Croft,et al.  A language modeling approach to information retrieval , 1998, SIGIR '98.

[11]  Robert Wing Pong Luk,et al.  On event space and rank equivalence between probabilistic retrieval models , 2008, Information Retrieval.

[12]  Thomas Roelleke,et al.  TF-IDF uncovered: a study of theories and probabilities , 2008, SIGIR '08.

[13]  John D. Lafferty,et al.  A risk minimization framework for information retrieval , 2006, Inf. Process. Manag..

[14]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[15]  Fabio Crestani,et al.  “Is this document relevant?…probably”: a survey of probabilistic models in information retrieval , 1998, CSUR.

[16]  Djoerd Hiemstra,et al.  Language Modeling and Relevance , 2003 .

[17]  ChengXiang Zhai,et al.  Statistical Language Models for Information Retrieval: A Critical Review , 2008, Found. Trends Inf. Retr..

[18]  Thomas Roelleke,et al.  A parallel derivation of probabilistic information retrieval models , 2006, SIGIR.

[19]  Djoerd Hiemstra,et al.  Using language models for information retrieval , 2001 .

[20]  Stephen E. Robertson,et al.  On Event Spaces and Probabilistic Models in Information Retrieval , 2005, Information Retrieval.