Using Word Embeddings for Query Translation for Hindi to English Cross Language Information Retrieval

Cross-Language Information Retrieval (CLIR) has become an important problem to solve in the recent years due to the growth of content in multiple languages in the Web. One of the standard methods is to use query translation from source to target language. In this paper, we propose an approach based on word embeddings, a method that captures contextual clues for a particular word in the source language and gives those words as translations that occur in a similar context in the target language. Once we obtain the word embeddings of the source and target language pairs, we learn a projection from source to target word embeddings, making use of a dictionary with word translation pairs. We then propose various methods of query translation and aggregation. The advantage of this approach is that it does not require the corpora to be aligned (which is difficult to obtain for resource-scarce languages), a dictionary with word translation pairs is enough to train the word vectors for translation. We experiment with Forum for Information Retrieval and Evaluation (FIRE) 2008 and 2012 datasets for Hindi to English CLIR. The proposed word embedding based approach outperforms the basic dictionary based approach by 70% and when the word embeddings are combined with the dictionary, the hybrid approach beats the baseline dictionary based method by 77%. It outperforms the English monolingual baseline by 15%, when combined with the translations obtained from Google Translate and Dictionary.

[1]  Ryan Cotterell,et al.  Morphological Word-Embeddings , 2019, NAACL.

[2]  Paolo Rosso,et al.  A Knowledge-based Representation for Cross-Language Document Retrieval and Categorization , 2014, EACL.

[3]  Jimmy J. Lin,et al.  Combining Statistical Translation Techniques for Cross-Language Information Retrieval , 2012, COLING.

[4]  Harsh Satija Using Disambiguated Word-embeddings for Exploiting Similarities among Languages for Machine Translation , 2016 .

[5]  Manaal Faruqui,et al.  Improving Vector Space Word Representations Using Multilingual Correlation , 2014, EACL.

[6]  Jimmy J. Lin,et al.  Looking inside the box: context-sensitive translation for cross-language information retrieval , 2012, SIGIR '12.

[7]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[8]  Nikos D. Sidiropoulos,et al.  Translation Invariant Word Embeddings , 2015, EMNLP.

[9]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[10]  Gareth J. F. Jones,et al.  A Comparative Study of Online Translation Services for Cross Language Information Retrieval , 2015, WWW.

[11]  Pushpak Bhattacharyya,et al.  Hindi to English and Marathi to English Cross Language Information Retrieval Evaluation , 2008, CLEF.

[12]  W. Bruce Croft,et al.  Dictionary Methods for Cross-Lingual Information Retrieval , 1996, DEXA.

[13]  K. Saravanan,et al.  "They Are Out There, If You Know Where to Look": Mining Transliterations of OOV Query Terms for Cross-Language Information Retrieval , 2009, ECIR.

[14]  Anders Søgaard,et al.  Simple task-specific bilingual word embeddings , 2015, NAACL.

[15]  Susan T. Dumais,et al.  Automatic Cross-Language Information Retrieval Using Latent Semantic Indexing , 1998 .

[16]  Iryna Gurevych,et al.  Combining Query Translation Techniques to Improve Cross-Language Information Retrieval , 2011, ECIR.

[17]  Ajay Nagesh,et al.  Evaluation of Hindi to English , Marathi to English and English to Hindi CLIR at FIRE 2008 , 2008 .

[18]  Gregory Grefenstette,et al.  Querying across languages: a dictionary-based approach to multilingual information retrieval , 1996, SIGIR '96.

[19]  Stefan Riezler,et al.  Learning Translational and Knowledge-based Similarities from Relevance Rankings for Cross-Language Retrieval , 2014, ACL.

[20]  Marie-Francine Moens,et al.  Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings , 2015, SIGIR.

[21]  Tamara G. Kolda,et al.  Cross-language information retrieval using PARAFAC2 , 2007, KDD '07.

[22]  Ari Pirkola,et al.  The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval , 1998, SIGIR '98.

[23]  A. Kumaran,et al.  Cross-Lingual Information Retrieval System for Indian Languages , 2008, IJCNLP.

[24]  Quoc V. Le,et al.  Exploiting Similarities among Languages for Machine Translation , 2013, ArXiv.

[25]  Sudeshna Sarkar,et al.  Improving Cross Language Information Retrieval Using Corpus Based Query Suggestion Approach , 2015, CICLing.

[26]  Philipp Cimiano,et al.  Cross-language Information Retrieval with Explicit Semantic Analysis , 2008, CLEF.

[27]  Stefan Riezler,et al.  Learning to translate queries for CLIR , 2014, SIGIR.

[28]  Tie-Yan Liu,et al.  Co-learning of Word Representations and Morpheme Representations , 2014, COLING.

[29]  Douglas W. Oard,et al.  Dictionary-based techniques for cross-language information retrieval , 2005, Inf. Process. Manag..

[30]  John C. Platt,et al.  Learning Discriminative Projections for Text Similarity Measures , 2011, CoNLL.