Relating the new language models of information retrieval to the traditional retrieval models

During the last two years, exciting new approaches to information retrieval were introduced by a number of different research groups that use statistical language models for retrieval. This paper relates the retrieval algorithms suggested by these approaches to widely accepted retrieval algorithms developed within three traditional models of information retrieval: the Boolean model, the vector space model and the probabilistic model. The paper shows the existence of efficient retrieval algorithms that only use the matching terms in their computation. Under these conditions, the language models of information retrieval are surprisingly similar to both tf.idf term weighting as developed for the vector space model and relevance weighting as developed in the traditional probabilistic model. The paper suggests a new method for relevance weighting and a new method to rank documents giving Boolean queries. Experimental results on the TREC collection indicate that the language modelling approach outperforms the three traditional approaches.

[1]  John D. Lafferty,et al.  Information retrieval as statistical translation , 1999, SIGIR '99.

[2]  Stephen E. Robertson,et al.  Okapi at TREC-7: Automatic Ad Hoc, Filtering, VLC and Interactive , 1998, TREC.

[3]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[4]  Donna K. Harman,et al.  How effective is suffixing? , 1991, J. Am. Soc. Inf. Sci..

[5]  Kenney Ng A Maximum Likelihood Ratio Information Retrieval Model , 1999, TREC.

[6]  Richard M. Schwartz,et al.  BBN at TREC7: Using Hidden Markov Models for Information Retrieval , 1998, TREC.

[7]  Djoerd Hiemstra,et al.  Disambiguation Strategies for Cross-Language Information Retrieval , 1999, ECDL.

[8]  David E. Losada,et al.  Using a belief revision operator for document ranking in extended Boolean models , 1999, SIGIR '99.

[9]  Arjen P. de Vries,et al.  Content and multimedia database management systems , 1999 .

[10]  Chris Buckley,et al.  Pivoted Document Length Normalization , 1996, SIGIR Forum.

[11]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[12]  Djoerd Hiemstra,et al.  A Linguistically Motivated Probabilistic Model of Information Retrieval , 1998, ECDL.

[13]  H. J. Larson,et al.  Introduction to the Theory of Statistics , 1973 .

[14]  Kenneth Ward Church,et al.  Introduction to the Special Issue on Computational Linguistics Using Large Corpora , 1993, Comput. Linguistics.

[15]  C. Paice Soft evaluation of Boolean search queries in information retrieval systems , 1984 .

[16]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[17]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[18]  W. Bruce Croft,et al.  A Comparison of Text Retrieval Models , 1992, Comput. J..

[19]  Richard M. Schwartz,et al.  A hidden Markov model information retrieval system , 1999, SIGIR '99.

[20]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[21]  Kenneth Ward Church,et al.  Inverse Document Frequency (IDF): A Measure of Deviations from Poisson , 1995, VLC@ACL.

[22]  S. Robertson The probability ranking principle in IR , 1997 .

[23]  Arjen P. de Vries,et al.  The Mirror DBMS at TREC-8 , 1999, TREC.

[24]  Djoerd Hiemstra,et al.  A probabilistic justification for using tf×idf term weighting in information retrieval , 2000, International Journal on Digital Libraries.

[25]  François Schiettecatte,et al.  Document Retrieval Using The MPS Information Server (A Report on the TREC-4 Experiment) , 1995, TREC.

[26]  John D. Lafferty,et al.  The Weaver System for Document Retrieval , 1999, TREC.

[27]  Yiyu Yao,et al.  On modeling information retrieval with probabilistic inference , 1995, TOIS.

[28]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[29]  Edward A. Fox,et al.  Research Contributions , 2014 .

[30]  W. Bruce Croft,et al.  Computationally tractable probabilistic modeling of Boolean operators , 1997, SIGIR '97.

[31]  W. Bruce Croft,et al.  A general language model for information retrieval , 1999, CIKM '99.

[32]  Djoerd Hiemstra,et al.  Twenty-One at TREC-8: using Language Technology for Information Retrieval , 1999, TREC.

[33]  J. Lee Analyzing the Effectiveness of Extended Boolean Models in Information Retrieval , 1995 .

[34]  Djoerd Hiemstra,et al.  Twenty-One at TREC7: Ad-hoc and Cross-Language Track , 1998, TREC.