An expectation-maximization algorithm for query translation based on pseudo-relevant documents

A query translation method based on expectation maximization algorithm is proposed.The method (EM4QT) exploits pseudo-relevant documents in source and target languages.EM4QT extracts a number of hidden variables for each translation pair.EM4QT employs an expectation maximization algorithm for estimating the parameters.EM4QT outperforms competitive baselines in cross-language information retrieval. Query translation in cross-language information retrieval (CLIR) can be done by employing dictionaries, aligned corpora, or machine translators. Scarcity of aligned corpora for various domains in many language pairs intensifies the importance of dictionary-based CLIR which motivates us to use only a bilingual dictionary and two independent collections in source and target languages for query translation. We exploit pseudo-relevant documents for a given query in the source language and pseudo-relevant documents for a translation of the query in the target language with a proposed expectation-maximization algorithm for improving query translation. The proposed method (called EM4QT) assumes that each target term either is translated from the source pseudo-relevant documents or has come from a noisy collection. Since EM4QT does not directly consider term coherency, which is defined as fluency of the target translation, we investigate a crucial question: can EM4QT be improved using either coherency-based methods or token-to-token translation ones? To address this question, we combine different translation models via simple linear interpolation and a proposed divergence minimization method. Evaluations over four CLEF collections in Persian, French, Spanish, and German indicate that EM4QT significantly outperforms competitive baselines in all the collections. Our experiments also reveal that since EM4QT indirectly considers term coherency, combining the method with coherency-based models cannot significantly improve the retrieval performance. On the other hand, investigating the query-by-query results supports the view that EM4QT usually gives a relatively high weight to one translation and its combination with the proposed token-to-token translation model, which is obtained by running EM4QT for each query term separately, soothes the effect and reaches better results for many queries. Comparing the method with a competitive word-embedding baseline reveals the superiority of the proposed model.

[1]  Yoshua Bengio,et al.  BilBOWA: Fast Bilingual Distributed Representations without Word Alignments , 2014, ICML.

[2]  Azadeh Shakery,et al.  Dimension Projection Among Languages Based on Pseudo-Relevant Documents for Query Translation , 2016, ECIR.

[3]  John C. Platt,et al.  Learning Discriminative Projections for Text Similarity Measures , 2011, CoNLL.

[4]  John D. Lafferty,et al.  Model-based feedback in the language modeling approach to information retrieval , 2001, CIKM '01.

[5]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[6]  W. Bruce Croft,et al.  Embedding-based Query Language Models , 2016, ICTIR.

[7]  Marie-Francine Moens,et al.  Probabilistic Models of Cross-Lingual Semantic Similarity in Context Based on Latent Cross-Lingual Concepts Induced from Comparable Data , 2014, EMNLP.

[8]  W. Bruce Croft,et al.  Using Probabilistic Models of Document Retrieval without Relevance Information , 1979, J. Documentation.

[9]  Yi Liu,et al.  A maximum coherence model for dictionary-based cross-language information retrieval , 2005, SIGIR '05.

[10]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[11]  ChengXiang Zhai,et al.  A Note on the Expectation-Maximization (EM) Algorithm , 2004 .

[12]  Azadeh Shakery,et al.  Multilingual information retrieval in the language modeling framework , 2015, Information Retrieval Journal.

[13]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[14]  Martti Juhola,et al.  Creating and exploiting a comparable corpus in cross-language information retrieval , 2007, TOIS.

[15]  Azadeh Shakery,et al.  A Language Modeling Approach for Extracting Translation Knowledge from Comparable Corpora , 2013, ECIR.

[16]  Turid Hedlund,et al.  Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings , 2001, Information Retrieval.

[17]  W. Bruce Croft,et al.  Cross-lingual relevance models , 2002, SIGIR '02.

[18]  Azadeh Shakery,et al.  Axiomatic Analysis for Improving the Log-Logistic Feedback Model , 2016, SIGIR.

[19]  Azadeh Shakery,et al.  Mining a Persian-English comparable corpus for cross-language information retrieval , 2014, Inf. Process. Manag..

[20]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[21]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[22]  Douglas W. Oard,et al.  Dictionary-based techniques for cross-language information retrieval , 2005, Inf. Process. Manag..

[23]  Gregory Grefenstette,et al.  Cross-Language Information Retrieval , 1998, The Springer International Series on Information Retrieval.

[24]  W. Bruce Croft,et al.  Quary Expansion Using Local and Global Document Analysis , 1996, SIGIR Forum.

[25]  Paolo Rosso,et al.  A systematic study of knowledge graph analysis for cross-language plagiarism detection , 2016, Inf. Process. Manag..

[26]  Azadeh Shakery,et al.  Revisiting Optimal Rank Aggregation: A Dynamic Programming Approach , 2015, ICTIR.

[27]  Jimmy J. Lin,et al.  Combining Statistical Translation Techniques for Cross-Language Information Retrieval , 2012, COLING.

[28]  Paolo Rosso,et al.  A Knowledge-based Representation for Cross-Language Document Retrieval and Categorization , 2014, EACL.

[29]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[30]  Christof Monz,et al.  Iterative translation disambiguation for cross-language information retrieval , 2005, SIGIR '05.

[31]  Jinxi Xu,et al.  Empirical studies on the impact of lexical resources on CLIR performance , 2005, Inf. Process. Manag..

[32]  Marie-Francine Moens,et al.  Probabilistic topic modeling in multilingual settings: An overview of its methodology and applications , 2015, Inf. Process. Manag..

[33]  W. Bruce Croft,et al.  Cross-Language Pseudo-Relevance Feedback Techniques for Informal Text , 2014, ECIR.

[34]  Azadeh Shakery,et al.  Using Learning to Rank Approach for Parallel Corpora Based Cross Language Information Retrieval , 2012, ECAI.

[35]  Nic Gearailt,et al.  Dictionary characteristics in cross-language information retrieval , 2003 .

[36]  Azadeh Shakery,et al.  Leveraging comparable corpora for cross-lingual information retrieval in resource-lean language pairs , 2012, Information Retrieval.

[37]  Azadeh Shakery,et al.  Sentence alignment using local and global information , 2016, Comput. Speech Lang..

[38]  Azadeh Shakery,et al.  A Probabilistic Translation Method for Dictionary-based Cross-lingual Information Retrieval in Agglutinative Languages , 2014, ArXiv.

[39]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[40]  Quoc V. Le,et al.  Exploiting Similarities among Languages for Machine Translation , 2013, ArXiv.

[41]  ChengXiang Zhai,et al.  Statistical Language Models for Information Retrieval , 2008, NAACL.

[42]  Djoerd Hiemstra,et al.  Luhn Revisited: Significant Words Language Models , 2016, CIKM.

[43]  Parth Gupta,et al.  Query expansion for mixed-script information retrieval , 2014, SIGIR.

[44]  Philipp Koehn,et al.  Synthesis Lectures on Human Language Technologies , 2016 .

[45]  Marie-Francine Moens,et al.  Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings , 2015, SIGIR.

[46]  Changning Huang,et al.  Improving query translation for cross-language information retrieval using statistical models , 2001, SIGIR '01.

[47]  Azadeh Shakery,et al.  Building a multi-domain comparable corpus using a learning to rank method† , 2016, Natural Language Engineering.

[48]  Jianfeng Gao,et al.  Linear discriminant model for information retrieval , 2005, SIGIR '05.

[49]  Azadeh Shakery,et al.  Pseudo-Relevance Feedback Based on Matrix Factorization , 2016, CIKM.

[50]  John D. Lafferty,et al.  Document Language Models, Query Models, and Risk Minimization for Information Retrieval , 2001, SIGIR Forum.

[51]  Fredric C. Gey,et al.  Combining Query Translation and Document Translation in Cross-Language Retrieval , 2003, CLEF.

[52]  Gareth J. F. Jones,et al.  Cross-Lingual Topical Relevance Models , 2012, COLING.

[53]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[54]  Ari Pirkola,et al.  The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval , 1998, SIGIR '98.

[55]  John C. Platt,et al.  Translingual Document Representations from Discriminative Projections , 2010, EMNLP.