Estimation of Query Model from Parsimonious Translation Model

The KL divergence framework, the extended language modeling approach, have a critical problem with estimation of query model, which is the probabilistic model that encodes user's information need. However, at initial retrieval, it is difficult to expand query model using co-occurrence, because the two-dimensional matrix information such as term co-occurrence must be constructed in offline. Especially in large collection, constructing such large matrix of term co-occurrences prohibitively increases time and space complexity. This paper proposes an effective method to construct co-occurrence statistics by employing parsimonious translation model. Parsimonious translation model is a compact version of translation model, and it contains very small number of parameters that includes non-zero probabilities. Parsimonious translation model enables us to enormously reduce the number of remaining terms in document so that co-occurrence statistics can be calculated in tractable time. In experimentations, the results show that query model derived from parsimonious translation model significantly improves baseline language modeling performance.

[1]  W. Bruce Croft,et al.  A general language model for information retrieval , 1999, CIKM '99.

[2]  John D. Lafferty,et al.  Model-based feedback in the language modeling approach to information retrieval , 2001, CIKM '01.

[3]  Nancy Ide,et al.  Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art , 1998, Comput. Linguistics.

[4]  Djoerd Hiemstra,et al.  Language models and probability of relevance , 2001 .

[5]  W. Bruce Croft,et al.  Cluster-based retrieval using language models , 2004, SIGIR '04.

[6]  Djoerd Hiemstra,et al.  Using language models for information retrieval , 2001 .

[7]  Rohini K. Srihari,et al.  Biterm language models for document retrieval , 2002, SIGIR '02.

[8]  Djoerd Hiemstra,et al.  Bayesian extension to the language model for ad hoc information retrieval , 2003, SIGIR.

[9]  Douglas W. Oard,et al.  Structured translation for cross-language information retrieval , 2000, SIGIR '00.

[10]  W. Bruce Croft,et al.  A general language model for information retrieval (poster abstract) , 1999, Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

[11]  James Allan,et al.  Capturing term dependencies using a language model based on sentence trees , 2002, CIKM '02.

[12]  J. H. Lee,et al.  n-Gram-based indexing for Korean text retrieval , 1999, Inf. Process. Manag..

[13]  John Lafferty,et al.  Information retrieval as statistical translation , 1999, SIGIR 1999.

[14]  W. Bruce Croft,et al.  Cross-lingual relevance models , 2002, SIGIR '02.

[15]  Djoerd Hiemstra,et al.  Parsimonious language models for information retrieval , 2004, SIGIR '04.

[16]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[17]  Djoerd Hiemstra,et al.  Term-specific smoothing for the language modeling approach to information retrieval: the importance of a query term , 2002, SIGIR '02.

[18]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[19]  Richard M. Schwartz,et al.  A hidden Markov model information retrieval system , 1999, SIGIR '99.