A parallel derivation of probabilistic information retrieval models

This paper investigates in a stringent athematical formalism the parallel derivation of three grand probabilistic retrieval models: binary independent retrieval (BIR), Poisson model (PM), and language modelling (LM).The investigation has been motivated by a number of questions. Firstly, though sharing the same origin, namely the probability of relevance, the models differ with respect to event spaces. How can this be captured in a consistent notation, and can we relate the event spaces? Secondly, BIR and PM are closely related, but how does LM fit in? Thirdly, how are tf-idf and probabilistic models related? .The parallel investigation of the models leads to a number of formalised results: BIR and PM assume the collection to be a set of non-relevant documents, whereas LM assumes the collection to be a set of terms from relevant documents.PM can be viewed as a bridge connecting BIR and LM.A BIR-LM equivalence explains BIR as a special LM case.PM explains tf-idf, and both, BIR and LM probabilities express tf-idf in a dual way..

[1]  Eugene L. Margulis,et al.  N-Poisson document modelling , 1992, SIGIR '92.

[2]  C. J. van Rijsbergen,et al.  The geometry of information retrieval , 2004 .

[3]  Stephen E. Robertson,et al.  Large Test Collection Experiments on an Operational, Interactive System: Okapi at TREC , 1995, Inf. Process. Manag..

[4]  ChengXiang Zhai,et al.  Probabilistic Relevance Models Based on Document and Query Generation , 2003 .

[5]  Kenneth Ward Church,et al.  Inverse Document Frequency (IDF): A Measure of Deviations from Poisson , 1995, VLC@ACL.

[6]  C. J. van Rijsbergen,et al.  Term Frequency Normalization via Pareto Distributions , 2002, ECIR.

[7]  Amit Singhal,et al.  Pivoted document length normalization , 1996, SIGIR 1996.

[8]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[9]  Gabriella Kazai,et al.  A general matrix framework for modelling Information Retrieval , 2006, Inf. Process. Manag..

[10]  Djoerd Hiemstra,et al.  A probabilistic justification for using tf×idf term weighting in information retrieval , 2000, International Journal on Digital Libraries.

[11]  Marcia J. Bates After the Dot-Bomb: Getting Web Information Retrieval Right This Time , 2002, First Monday.

[12]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[13]  Gianni Amati,et al.  Probability models for information retrieval based on divergence from randomness , 2003 .

[14]  Djoerd Hiemstra,et al.  Language Modelling and Relevance , 2003 .

[15]  Stephen E. Robertson,et al.  Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[16]  Djoerd Hiemstra,et al.  The Importance of Prior Probabilities for Entry Page Search , 2002, SIGIR '02.

[17]  Arjen P. de Vries,et al.  Relevance information: a loss of entropy but a gain for IDF? , 2005, SIGIR '05.

[18]  Stephen E. Robertson,et al.  On Event Spaces and Probabilistic Models in Information Retrieval , 2005, Information Retrieval.

[19]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[20]  Chris Buckley,et al.  Pivoted Document Length Normalization , 1996, SIGIR Forum.