Smoothing document language models with probabilistic term count propagation

Smoothing of document language models is critical in language modeling approaches to information retrieval. In this paper, we present a novel way of smoothing document language models based on propagating term counts probabilistically in a graph of documents. A key difference between our approach and previous approaches is that our smoothing algorithm can iteratively propagate counts and achieve smoothing with remotely related documents. Evaluation results on several TREC data sets show that the proposed method significantly outperforms the simple collection-based smoothing method. Compared with those other smoothing methods that also exploit local corpus structures, our method is especially effective in improving precision in top-ranked documents through “filling in” missing query terms in relevant documents, which is attractive since most users only pay attention to the top-ranked documents in search engine applications.

[1]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[2]  Oren Kurland,et al.  Corpus structure, language models, and ad hoc information retrieval , 2004, SIGIR '04.

[3]  John D. Lafferty,et al.  Model-based feedback in the language modeling approach to information retrieval , 2001, CIKM '01.

[4]  James P. Callan,et al.  Combining document representations for known-item search , 2003, SIGIR.

[5]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[6]  Oren Kurland,et al.  PageRank without hyperlinks: structural re-ranking using links induced by language models , 2005, SIGIR '05.

[7]  Ellen M. Voorhees,et al.  TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing) , 2005 .

[8]  Alan F. Smeaton,et al.  Automatic link generation , 1999, CSUR.

[9]  Djoerd Hiemstra,et al.  Twenty-One at TREC7: Ad-hoc and Cross-Language Track , 1998, TREC.

[10]  Tao Tao,et al.  Language Model Information Retrieval with Document Expansion , 2006, NAACL.

[11]  W. Bruce Croft,et al.  Cluster-based retrieval using language models , 2004, SIGIR '04.

[12]  Luo Si,et al.  Modeling search engine effectiveness for federated search , 2005, SIGIR '05.

[13]  Nick Craswell,et al.  Random walks on the click graph , 2007, SIGIR.

[14]  Ben Shneiderman,et al.  A Spectrum of Automatic Hypertext Constructions , 1989, Hypermedia.

[15]  Taher H. Haveliwala Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search , 2003, IEEE Trans. Knowl. Data Eng..

[16]  John D. Lafferty,et al.  Two-stage language models for information retrieval , 2002, SIGIR '02.

[17]  M. de Rijke,et al.  Formal models for expert finding in enterprise corpora , 2006, SIGIR.

[18]  G. Grimmett,et al.  Probability and random processes , 2002 .

[19]  Richard M. Schwartz,et al.  A hidden Markov model information retrieval system , 1999, SIGIR '99.

[20]  Mark E. Frisse Searching for Information in a Hypertext Medical Handbook , 1987, Hypertext.

[21]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[22]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[23]  Anastasios Tombros,et al.  The effectiveness of query-based hierarchic clustering of documents for information retrieval , 2002 .

[24]  Ellen M. Vdorhees The cluster hypothesis revisited , 1985, SIGIR 1985.

[25]  ChengXiang Zhai,et al.  Probabilistic Models for Expert Finding , 2007, ECIR.

[26]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[27]  Azadeh Shakery,et al.  A probabilistic relevance propagation model for hypertext retrieval , 2006, CIKM '06.

[28]  Shlomo Moran,et al.  The stochastic approach for link-structure analysis (SALSA) and the TKC effect , 2000, Comput. Networks.

[29]  W. Bruce Croft,et al.  A language modeling approach to information retrieval , 1998, SIGIR '98.

[30]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[31]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[32]  Ellen M. Voorhees The Cluster Hypothesis Revisited , 1985, SIGIR.