The Challenges of Optimizing Machine Translation for Low Resource Cross-Language Information Retrieval

When performing cross-language information retrieval (CLIR) for lower-resourced languages, a common approach is to retrieve over the output of machine translation (MT). However, there is no established guidance on how to optimize the resulting MT-IR system. In this paper, we examine the relationship between the performance of MT systems and both neural and term frequency-based IR models to identify how CLIR performance can be best predicted from MT quality. We explore performance at varying amounts of MT training data, byte pair encoding (BPE) merge operations, and across two IR collections and retrieval models. We find that the choice of IR collection can substantially affect the predictive power of MT tuning decisions and evaluation, potentially introducing dissociations between MT-only and overall CLIR performance.

[1]  Bhaskar Mitra,et al.  Cross Domain Regularization for Neural Ranking Models using Adversarial Learning , 2018, SIGIR.

[2]  Alistair Moffat,et al.  A similarity measure for indefinite rankings , 2010, TOIS.

[3]  Christof Monz,et al.  Adaptation of Statistical Machine Translation Model for Cross-Lingual Information Retrieval in a Service Context , 2012, EACL.

[4]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 2 , 2000, Inf. Process. Manag..

[5]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[6]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[7]  Nick Craswell,et al.  Learning to Match using Local and Distributed Representations of Text for Web Search , 2016, WWW.

[8]  Stefan Riezler,et al.  Bag-of-Words Forced Decoding for Cross-Lingual Information Retrieval , 2015, NAACL.

[9]  W. Bruce Croft,et al.  A Deep Look into Neural Ranking Models for Information Retrieval , 2019, Inf. Process. Manag..

[10]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[11]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[12]  Bhaskar Mitra,et al.  An Updated Duet Model for Passage Re-ranking , 2019, ArXiv.

[13]  Kevin Duh,et al.  Cross-Lingual Learning-to-Rank with Shared Representations , 2018, NAACL.

[14]  Marie-Francine Moens,et al.  Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings , 2015, SIGIR.

[15]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[16]  GoeuriotLorraine,et al.  Adaptation of machine translation for multilingual information retrieval in the medical domain , 2014 .

[17]  Goran Glavas,et al.  Unsupervised Cross-Lingual Information Retrieval Using Monolingual Data Only , 2018, SIGIR.

[18]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[19]  W. Bruce Croft,et al.  Neural Ranking Models with Weak Supervision , 2017, SIGIR.

[20]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.