论文信息 - A Comparison of Methods for Identifying the Translation of Words in a Comparable Corpus: Recipes and Limits

A Comparison of Methods for Identifying the Translation of Words in a Comparable Corpus: Recipes and Limits

Identifying translations in comparable corpora is a challenge that has attracted many researchers since a long time. It has applications in several applications including Machine Translation and Cross-lingual Information Retrieval. In this study we compare three state-of-the-art approaches for these tasks: the so-called context-based projection method, the projection of monolingual word embeddings, as well as a method dedicated to identify translations of rare words. We carefully explore the hyper-parameters of each method and measure their impact on the task of identifying the translation of English words in Wikipedia into French. Contrary to the standard practice, we designed a test case where we do not resort to heuristics in order to pre-select the target vocabulary among which to find translations, therefore pushing each method to its limit. We show that all the approaches we tested have a clear bias toward frequent words. In fact, the best approach we tested could identify the translation of a third of a set of frequent test words, while it could only translate around 10% of rare words.

Philippe Langlais | Laurent Jakubina

[1] Philippe Langlais,et al. Identifying Parallel Documents from a Large Bilingual Collection of Texts: Application to Parallel Article Extraction in Wikipedia. , 2011, BUCC@ACL.

[2] Omer Levy,et al. Reconsidering Cross-lingual Word Embeddings , 2016, ArXiv.

[3] Georgiana Dinu,et al. Improving zero-shot learning by mitigating the hubness problem , 2014, ICLR.

[4] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[5] Viktor Pekar,et al. Finding translations for low-frequency words in comparable corpora , 2006, Machine Translation.

[6] Simone Paolo Ponzetto,et al. Collaboratively built semi-structured content and Artificial Intelligence: The story so far , 2013, Artif. Intell..

[7] Sophia Ananiadou,et al. Combining String and Context Similarity for Bilingual Term Alignment from Comparable Corpora , 2014, EMNLP.

[8] Kenneth Ward Church,et al. Identifying Word Correspondences in Parallel Texts , 1991, HLT.

[9] Omer Levy,et al. Improving Distributional Similarity with Lessons Learned from Word Embeddings , 2015, TACL.

[10] Pascale Fung,et al. Rare Word Translation Extraction from Aligned Comparable Documents , 2011, ACL.

[11] Reinhard Rapp,et al. Identifying Word Translations in Non-Parallel Texts , 1995, ACL.