Parsimonious translation models for information retrieval

In the KL divergence framework, the extended language modeling approach has a critical problem of estimating a query model, which is the probabilistic model that encodes the user's information need. For query expansion in initial retrieval, the translation model had been proposed to involve term co-occurrence statistics. However, the translation model was difficult to apply, because the term co-occurrence statistics must be constructed in the offtine time. Especially in a large collection, constructing such a large matrix of term co-occurrences statistics prohibitively increases time and space complexity. In addition, reliable retrieval performance cannot be guaranteed because the translation model may comprise noisy non-topical terms in documents. To resolve these problems, this paper investigates an effective method to construct co-occurrence statistics and eliminate noisy terms by employing a parsimonious translation model. The parsimonious translation model is a compact version of a translation model that can reduce the number of terms containing non-zero probabilities by eliminating non-topical terms in documents. Through experimentation on seven different test collections, we show that the query model estimated from the parsimonious translation model significantly outperforms not only the baseline language modeling, but also the non-parsimonious models.

[1]  J. Doob Stochastic processes , 1953 .

[2]  W. Bruce Croft,et al.  A general language model for information retrieval (poster abstract) , 1999, SIGIR '99.

[3]  John D. Lafferty,et al.  Information Retrieval as Statistical Translation , 2017 .

[4]  Jianfeng Gao,et al.  Dependence language model for information retrieval , 2004, SIGIR '04.

[5]  John D. Lafferty,et al.  Model-based feedback in the language modeling approach to information retrieval , 2001, CIKM '01.

[6]  W. Bruce Croft,et al.  A language modeling approach to information retrieval , 1998, SIGIR '98.

[7]  Rohini K. Srihari,et al.  Biterm language models for document retrieval , 2002, SIGIR '02.

[8]  Djoerd Hiemstra,et al.  Parsimonious language models for information retrieval , 2004, SIGIR '04.

[9]  Elad Yom-Tov,et al.  Learning to estimate query difficulty: including applications to missing content detection and distributed information retrieval , 2005, SIGIR '05.

[10]  James Allan,et al.  Capturing term dependencies using a language model based on sentence trees , 2002, CIKM '02.

[11]  Oren Kurland,et al.  Corpus structure, language models, and ad hoc information retrieval , 2004, SIGIR '04.

[12]  Charles H. Davis,et al.  An introduction to programming: A Structured Approach Using PL/1 and PL/C-7. Richard Conway and David Gries. Cambridge, Mass.: Winthrop Publishers, Inc., 509 p. (1975) , 1976, J. Am. Soc. Inf. Sci..

[13]  W. Bruce Croft,et al.  A general language model for information retrieval , 1999, CIKM '99.

[14]  Amit Singhal,et al.  Document expansion for speech retrieval , 1999, SIGIR '99.

[15]  Wessel Kraaij,et al.  Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval , 2003, CL.

[16]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[17]  W. Bruce Croft,et al.  A framework for selective query expansion , 2004, CIKM '04.

[18]  Djoerd Hiemstra,et al.  Using language models for information retrieval , 2001 .

[19]  Douglas W. Oard,et al.  Structured translation for cross-language information retrieval , 2000, SIGIR '00.

[20]  Carmel Domshlak,et al.  Better than the real thing?: iterative pseudo-query processing using cluster-based language models , 2005, SIGIR '05.

[21]  Rong Jin,et al.  Title language model for information retrieval , 2002, SIGIR '02.

[22]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[23]  W. Bruce Croft,et al.  Cluster-based retrieval using language models , 2004, SIGIR '04.

[24]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[25]  Alan Bain,et al.  What is a Stochastic Process , 1942 .

[26]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[27]  Margo I. Seltzer,et al.  Berkeley DB , 1999, USENIX Annual Technical Conference, FREENIX Track.