A probabilistic justification for using tf×idf term weighting in information retrieval

Abstract.This paper presents a new probabilistic model of information retrieval. The most important modeling assumption made is that documents and queries are defined by an ordered sequence of single terms. This assumption is not made in well-known existing models of information retrieval, but is essential in the field of statistical natural language processing. Advances already made in statistical natural language processing will be used in this paper to formulate a probabilistic justification for using tf×idf term weighting. The paper shows that the new probabilistic interpretation of tf×idf term weighting might lead to better understanding of statistical ranking mechanisms, for example by explaining how they relate to coordination level ranking. A pilot experiment on the TREC collection shows that the linguistically motivated weighting algorithm outperforms the popular BM25 weighting algorithm.

[1]  Charles L. A. Clarke,et al.  Relevance ranking for one to three term queries , 1997, Inf. Process. Manag..

[2]  Djoerd Hiemstra,et al.  A Linguistically Motivated Probabilistic Model of Information Retrieval , 1998, ECDL.

[3]  J. Wolfowitz,et al.  An Introduction to the Theory of Statistics , 1951, Nature.

[4]  Djoerd Hiemstra,et al.  Twenty-One at TREC7: Ad-hoc and Cross-Language Track , 1998, TREC.

[5]  Richard M. Schwartz,et al.  BBN at TREC7: Using Hidden Markov Models for Information Retrieval , 1998, TREC.

[6]  Gerard Salton,et al.  On the Specification of Term Values in Automatic Indexing , 1973 .

[7]  W. Bruce Croft,et al.  Text retrieval and inference , 1992 .

[8]  Karen Spärck Jones,et al.  NLP Track at TREC-5 , 1996, TREC.

[9]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[10]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[11]  William S. Cooper,et al.  Some inconsistencies and misidentified modeling assumptions in probabilistic information retrieval , 1995, TOIS.

[12]  Ron Sacks-Davis,et al.  Similarity Measures for Short Queries , 1995, TREC.

[13]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[14]  Gerard Salton,et al.  Document Length Normalization , 1995, Inf. Process. Manag..

[15]  Daniel E. Rose,et al.  V-Twin: A Lightweight Engine for Interactive Use , 1996, TREC.

[16]  R. Plackett An introduction to the theory of statistics , 1972 .

[17]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[18]  David Hawking,et al.  Relevance weighting using distance between term occurrences , 1996 .

[19]  Djoerd Hiemstra,et al.  Cross-language retrieval in Twenty-One: using one, some or all possible translations? , 1998 .