Document Length Normalization

In the TREC collection -a large full-text experimental text collection with widely varying document lengths -we observe that the likelihood of a document being judged relevant by a user increases with the document length. We show that a retrieval strategy, such as the vector-space cosine match, that retrieves documents of different lengths with roughly equal probability, will not optimally retrieve useful documents from such a collection. We present a modified technique that attempts to match the likelihood of retrieving a document of a certain length to the likelihood of documents of that length being judged relevant, and show that this technique yields significant improvements in retrieval effectiveness.

[1]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[2]  Chris Buckley,et al.  The Importance of Proper Weighting Methods , 1993, HLT.

[3]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[4]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[5]  D. K. Harmon,et al.  Overview of the Third Text Retrieval Conference (TREC-3) , 1996 .

[6]  Kui-Lam Kwok,et al.  TREC-3 Ad-Hoc, Routing Retrieval and Thresholding Experiments using PIRCS , 1994, TREC.

[7]  James P. Callan,et al.  Passage-level evidence in document retrieval , 1994, SIGIR '94.

[8]  W. Bruce Croft,et al.  Document Retrieval and Routing Using the INQUERY System , 1994, TREC.

[9]  Clement T. Yu,et al.  A theory of term importance in automatic text analysis , 1974, J. Am. Soc. Inf. Sci..

[10]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[11]  Christian Plaunt,et al.  Subtopic structuring for full-length document access , 1993, SIGIR.

[12]  Gerard Salton,et al.  Automatic Text Theme Generation and the Analysis of Text Structure , 1994 .

[13]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[14]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[15]  Gerard Salton,et al.  Length Normalization in Degraded Text Collections , 1995 .

[16]  James Allan,et al.  Automatic Query Expansion Using SMART: TREC 3 , 1994, TREC.

[17]  Gerard Salton,et al.  Automatic Text Decomposition and Structuring , 1994, Inf. Process. Manag..

[18]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .