论文信息 - A parallel derivation of probabilistic information retrieval models - 字舞流文

A parallel derivation of probabilistic information retrieval models

This paper investigates in a stringent athematical formalism the parallel derivation of three grand probabilistic retrieval models: binary independent retrieval (BIR), Poisson model (PM), and language modelling (LM).The investigation has been motivated by a number of questions. Firstly, though sharing the same origin, namely the probability of relevance, the models differ with respect to event spaces. How can this be captured in a consistent notation, and can we relate the event spaces? Secondly, BIR and PM are closely related, but how does LM fit in? Thirdly, how are tf-idf and probabilistic models related? .The parallel investigation of the models leads to a number of formalised results: BIR and PM assume the collection to be a set of non-relevant documents, whereas LM assumes the collection to be a set of terms from relevant documents.PM can be viewed as a bridge connecting BIR and LM.A BIR-LM equivalence explains BIR as a special LM case.PM explains tf-idf, and both, BIR and LM probabilities express tf-idf in a dual way..

Thomas Roelleke | Jun Wang | Jun Wang | T. Roelleke

[1] Eugene L. Margulis,et al. N-Poisson document modelling , 1992, SIGIR '92.

[2] C. J. van Rijsbergen,et al. The geometry of information retrieval , 2004 .

[3] Stephen E. Robertson,et al. Large Test Collection Experiments on an Operational, Interactive System: Okapi at TREC , 1995, Inf. Process. Manag..

[4] ChengXiang Zhai,et al. Probabilistic Relevance Models Based on Document and Query Generation , 2003 .

[5] Kenneth Ward Church,et al. Inverse Document Frequency (IDF): A Measure of Deviations from Poisson , 1995, VLC@ACL.

[6] C. J. van Rijsbergen,et al. Term Frequency Normalization via Pareto Distributions , 2002, ECIR.

[7] Amit Singhal,et al. Pivoted document length normalization , 1996, SIGIR 1996.

[8] Stephen E. Robertson,et al. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[9] Gabriella Kazai,et al. A general matrix framework for modelling Information Retrieval , 2006, Inf. Process. Manag..

[10] Djoerd Hiemstra,et al. A probabilistic justification for using tf×idf term weighting in information retrieval , 2000, International Journal on Digital Libraries.

[11] Marcia J. Bates. After the Dot-Bomb: Getting Web Information Retrieval Right This Time , 2002, First Monday.

[12] C. J. van Rijsbergen,et al. Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[13] Gianni Amati,et al. Probability models for information retrieval based on divergence from randomness , 2003 .

[14] Djoerd Hiemstra,et al. Language Modelling and Relevance , 2003 .

[15] Stephen E. Robertson,et al. Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[16] Djoerd Hiemstra,et al. The Importance of Prior Probabilities for Entry Page Search , 2002, SIGIR '02.

[17] Arjen P. de Vries,et al. Relevance information: a loss of entropy but a gain for IDF? , 2005, SIGIR '05.

[18] Stephen E. Robertson,et al. On Event Spaces and Probabilistic Models in Information Retrieval , 2005, Information Retrieval.

[19] Stephen E. Robertson,et al. Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[20] Chris Buckley,et al. Pivoted Document Length Normalization , 1996, SIGIR Forum.