Unsupervised Extraction of Partial Translations for Neural Machine Translation

In neural machine translation (NMT), monolingual data are usually exploited through a so-called back-translation: sentences in the target language are translated into the source language to synthesize new parallel data. While this method provides more training data to better model the target language, on the source side, it only exploits translations that the NMT system is already able to generate using a model trained on existing parallel data. In this work, we assume that new translation knowledge can be extracted from monolingual data, without relying at all on existing parallel data. We propose a new algorithm for extracting from monolingual data what we call partial translations: pairs of source and target sentences that contain sequences of tokens that are translations of each other. Our algorithm is fully unsupervised and takes only source and target monolingual data as input. Our empirical evaluation points out that our partial translations can be used in combination with back-translation to further improve NMT models. Furthermore, while partial translations are particularly useful for low-resource language pairs, they can also be successfully exploited in resource-rich scenarios to improve translation quality.

[1]  Philipp Koehn,et al.  Six Challenges for Neural Machine Translation , 2017, NMT@ACL.

[2]  Guillaume Lample,et al.  Word Translation Without Parallel Data , 2017, ICLR.

[3]  Eneko Agirre,et al.  A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings , 2018, ACL.

[4]  Jiajun Zhang,et al.  Towards Neural Machine Translation with Partially Aligned Corpora , 2017, IJCNLP.

[5]  Myle Ott,et al.  Understanding Back-Translation at Scale , 2018, EMNLP.

[6]  George F. Foster,et al.  The Impact of Sentence Alignment Errors on Phrase-Based Machine Translation Performance , 2012, AMTA.

[7]  Satoshi Nakamura,et al.  Improving Neural Machine Translation through Phrase-based Forced Decoding , 2017, IJCNLP.

[8]  Martin Wattenberg,et al.  Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation , 2016, TACL.

[9]  Eneko Agirre,et al.  Unsupervised Statistical Machine Translation , 2018, EMNLP.

[10]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[11]  Masao Utiyama,et al.  Introduction of the Asian Language Treebank , 2016, 2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA).

[12]  Atsushi Fujita,et al.  Phrase Table Induction Using Monolingual Data for Low-Resource Statistical Machine Translation , 2018, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[13]  Christoph Tillmann,et al.  A Simple Sentence-Level Extraction Algorithm for Comparable Data , 2009, HLT-NAACL.

[14]  Samuel L. Smith,et al.  Offline bilingual word vectors, orthogonal transformations and the inverted softmax , 2017, ICLR.

[15]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[16]  Gholamreza Haffari,et al.  Iterative Back-Translation for Neural Machine Translation , 2018, NMT@ACL.

[17]  Kenneth Heafield,et al.  Copied Monolingual Data Improves Low-Resource Neural Machine Translation , 2017, WMT.

[18]  Guillaume Lample,et al.  Phrase-Based & Neural Unsupervised Machine Translation , 2018, EMNLP.

[19]  André F. T. Martins,et al.  Marian: Fast Neural Machine Translation in C++ , 2018, ACL.

[20]  Atsushi Fujita,et al.  Efficient Extraction of Pseudo-Parallel Sentences from Raw Monolingual Data Using Word Embeddings , 2017, ACL.

[21]  Holger Schwenk,et al.  Parallel sentence generation from comparable corpora for improved SMT , 2011, Machine Translation.

[22]  Anders Søgaard,et al.  On the Limitations of Unsupervised Bilingual Dictionary Induction , 2018, ACL.