A Comparison of Methods for Identifying the Translation of Words in a Comparable Corpus: Recipes and Limits

Identifying translations in comparable corpora is a challenge that has attracted many researchers since a long time. It has applications in several applications including Machine Translation and Cross-lingual Information Retrieval. In this study we compare three state-of-the-art approaches for these tasks: the so-called context-based projection method, the projection of monolingual word embeddings, as well as a method dedicated to identify translations of rare words. We carefully explore the hyper-parameters of each method and measure their impact on the task of identifying the translation of English words in Wikipedia into French. Contrary to the standard practice, we designed a test case where we do not resort to heuristics in order to pre-select the target vocabulary among which to find translations, therefore pushing each method to its limit. We show that all the approaches we tested have a clear bias toward frequent words. In fact, the best approach we tested could identify the translation of a third of a set of frequent test words, while it could only translate around 10% of rare words.

[1]  Philippe Langlais,et al.  Identifying Parallel Documents from a Large Bilingual Collection of Texts: Application to Parallel Article Extraction in Wikipedia. , 2011, BUCC@ACL.

[2]  Omer Levy,et al.  Reconsidering Cross-lingual Word Embeddings , 2016, ArXiv.

[3]  Georgiana Dinu,et al.  Improving zero-shot learning by mitigating the hubness problem , 2014, ICLR.

[4]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[5]  Viktor Pekar,et al.  Finding translations for low-frequency words in comparable corpora , 2006, Machine Translation.

[6]  Simone Paolo Ponzetto,et al.  Collaboratively built semi-structured content and Artificial Intelligence: The story so far , 2013, Artif. Intell..

[7]  Sophia Ananiadou,et al.  Combining String and Context Similarity for Bilingual Term Alignment from Comparable Corpora , 2014, EMNLP.

[8]  Kenneth Ward Church,et al.  Identifying Word Correspondences in Parallel Texts , 1991, HLT.

[9]  Omer Levy,et al.  Improving Distributional Similarity with Lessons Learned from Word Embeddings , 2015, TACL.

[10]  Pascale Fung,et al.  Rare Word Translation Extraction from Aligned Comparable Documents , 2011, ACL.

[11]  Reinhard Rapp,et al.  Identifying Word Translations in Non-Parallel Texts , 1995, ACL.

[12]  Pablo Gamallo Otero Learning bilingual lexicons from comparable English and Spanish corpora , 2007, MTSUMMIT.

[13]  Emmanuel Morin,et al.  Adaptive Dictionary for Bilingual Lexicon Extraction from Comparable Corpora , 2012, LREC.

[14]  Philippe Langlais,et al.  Revisiting Context-based Projection Methods for Term-Translation Spotting in Comparable Corpora , 2010, COLING.

[15]  Pierre Zweigenbaum,et al.  Overviewing Important Aspects of the Last Twenty Years of Research in Comparable Corpora , 2013, Building and Using Comparable Corpora.

[16]  Kun Yu,et al.  Extracting Bilingual Dictionary from Comparable Corpora with Dependency Heterogeneity , 2009, HLT-NAACL.

[17]  Quoc V. Le,et al.  Exploiting Similarities among Languages for Machine Translation , 2013, ArXiv.

[18]  Philippe Langlais,et al.  Projective methods for mining missing translations in DBpedia , 2015, BUCC@ACL/IJCNLP.

[19]  Pierre Zweigenbaum,et al.  Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web , 2011 .

[20]  Emmanuel Morin,et al.  Bilingual Lexicon Extraction from Comparable Corpora Enhanced with Parallel Corpora , 2011, BUCC@ACL.

[21]  Yoshua Bengio,et al.  BilBOWA: Fast Bilingual Distributed Representations without Word Alignments , 2014, ICML.

[22]  Emmanuel Morin,et al.  An Effective Compositional Model for Lexical Alignment , 2008, IJCNLP.

[23]  Pascale Fung,et al.  Compiling Bilingual Lexicon Entries From a Non-Parallel English-Chinese Corpus , 1995, VLC@ACL.

[24]  Yun-Chuang Chiao,et al.  A Novel Approach to Improve Word Translations Extraction from Non-Parallel , Comparable Corpora , 2004 .