A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval

Language modeling approaches to information retrieval are attractive and promising because they connect the problem of retrieval with that of language model estimation, which has been studied extensively in other application areas such as speech recognition. The basic idea of these approaches is to estimate a language model for each document, and then rank documents by the likelihood of the query according to the estimated language model. A core problem in language model estimation is smoothing, which adjusts the maximum likelihood estimator so as to correct the inaccuracy due to data sparseness. In this paper, we study the problem of language model smoothing and its influence on retrieval performance. We examine the sensitivity of retrieval performance to the smoothing parameters and compare several popular smoothing methods on different test collection.

[1]  Djoerd Hiemstra,et al.  The Importance of Prior Probabilities for Entry Page Search , 2002, SIGIR '02.

[2]  W. Bruce Croft,et al.  A Language Modeling Approach to Information Retrieval , 1998, SIGIR Forum.

[3]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[4]  Norbert Fuhr,et al.  Probabilistic Models in Information Retrieval , 1992, Comput. J..

[5]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[6]  Amit Singhal,et al.  Pivoted document length normalization , 1996, SIGIR 1996.

[7]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[8]  David J. Goodman,et al.  Personal Communications , 1994, Mobile Communications.

[9]  Yiyu Yao,et al.  On modeling information retrieval with probabilistic inference , 1995, TOIS.

[10]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[11]  Hermann Ney,et al.  On the Estimation of 'Small' Probabilities by Leaving-One-Out , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Kui-Lam Kwok,et al.  Improving two-stage ad-hoc retrieval for short queries , 1998, SIGIR '98.

[13]  W. Bruce Croft,et al.  A general language model for information retrieval (poster abstract) , 1999, SIGIR '99.

[14]  Peter Willett,et al.  Readings in information retrieval , 1997 .

[15]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[16]  Gerard Salton,et al.  Improving Retrieval Performance by Relevance Feedback , 1997 .

[17]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[18]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[19]  Hermann Ney,et al.  On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[20]  C. J. van Rijsbergen,et al.  A Non-Classical Logic for Information Retrieval , 1997, Comput. J..

[21]  David J. C. MacKay,et al.  A hierarchical Dirichlet language model , 1995, Natural Language Engineering.

[22]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[23]  Stanley F. Chen,et al.  An empirical study of smoothing techniques for language modeling , 1999 .

[24]  Djoerd Hiemstra,et al.  Twenty-One at TREC7: Ad-hoc and Cross-Language Track , 1998, TREC.

[25]  Stephen E. Robertson,et al.  Probabilistic models of indexing and searching , 1980, SIGIR '80.

[26]  W. Bruce Croft,et al.  A general language model for information retrieval , 1999, CIKM '99.

[27]  John D. Lafferty,et al.  Document Language Models, Query Models, and Risk Minimization for Information Retrieval , 2001, SIGIR Forum.

[28]  Richard M. Schwartz,et al.  A hidden Markov model information retrieval system , 1999, SIGIR '99.

[29]  John Lafferty,et al.  Information retrieval as statistical translation , 1999, SIGIR 1999.