论文信息 - Unsupervised Statistical Machine Translation - 字舞流文

Unsupervised Statistical Machine Translation

While modern machine translation has relied on large parallel corpora, a recent line of work has managed to train Neural Machine Translation (NMT) systems from monolingual corpora only (Artetxe et al., 2018c; Lample et al., 2018). Despite the potential of this approach for low-resource settings, existing systems are far behind their supervised counterparts, limiting their practical interest. In this paper, we propose an alternative approach based on phrase-based Statistical Machine Translation (SMT) that significantly closes the gap with supervised systems. Our method profits from the modular architecture of SMT: we first induce a phrase table from monolingual corpora through cross-lingual embedding mappings, combine it with an n-gram language model, and fine-tune hyperparameters through an unsupervised MERT variant. In addition, iterative backtranslation improves results further, yielding, for instance, 14.08 and 26.22 BLEU points in WMT 2014 English-German and English-French, respectively, an improvement of more than 7-10 BLEU points over previous unsupervised systems, and closing the gap with supervised SMT (Moses trained on Europarl) down to 2-5 BLEU points. Our implementation is available at https://github.com/artetxem/monoses.

Eneko Agirre | Gorka Labaka | Mikel Artetxe | Eneko Agirre | Mikel Artetxe | Gorka Labaka

[1] Franz Josef Och,et al. Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[2] Philipp Koehn,et al. Scalable Modified Kneser-Ney Language Model Estimation , 2013, ACL.

[3] Kevin Knight,et al. Deciphering Foreign Language , 2011, ACL.

[4] Meng Zhang,et al. Adversarial Training for Unsupervised Bilingual Lexicon Induction , 2017, ACL.

[5] Daniel Marcu,et al. Statistical Phrase-Based Translation , 2003, NAACL.

[6] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[7] John Cocke,et al. A Statistical Approach to Machine Translation , 1990, CL.

[8] Ashish Vaswani,et al. Unifying Bayesian Inference and Vector Space Models for Improved Decipherment , 2015, ACL.

[9] Noah A. Smith,et al. A Simple, Fast, and Effective Reparameterization of IBM Model 2 , 2013, NAACL.

[10] Philipp Koehn,et al. Six Challenges for Neural Machine Translation , 2017, NMT@ACL.

[11] Guillaume Lample,et al. Word Translation Without Parallel Data , 2017, ICLR.

[12] Kevin Knight,et al. Large Scale Decipherment for Out-of-Domain Machine Translation , 2012, EMNLP-CoNLL.

[13] Eneko Agirre,et al. Learning bilingual word embeddings with (almost) no bilingual data , 2017, ACL.

[14] Hermann Ney,et al. A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[15] Hai Zhao,et al. A Bilingual Graph-Based Semantic Model for Statistical Machine Translation , 2016, IJCAI.

[16] Guillaume Lample,et al. Unsupervised Machine Translation Using Monolingual Corpora Only , 2017, ICLR.

[17] Kai Zhao,et al. Learning Translation Models from Monolingual Continuous Representations , 2015, NAACL.

[18] Eneko Agirre,et al. Generalizing and Improving Bilingual Word Embedding Mappings with a Multi-Step Framework of Linear Transformations , 2018, AAAI.

[19] Quoc V. Le,et al. Exploiting Similarities among Languages for Machine Translation , 2013, ArXiv.

[20] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[21] Wei Chen,et al. Unsupervised Neural Machine Translation with Weight Sharing , 2018 .

[22] Kevin Knight,et al. Dependency-Based Decipherment for Resource-Limited Machine Translation , 2013, EMNLP.

[23] Rico Sennrich,et al. Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[24] Meng Zhang,et al. Earth Mover’s Distance Minimization for Unsupervised Bilingual Lexicon Induction , 2017, EMNLP.

[25] Eneko Agirre,et al. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings , 2018, ACL.

[26] Eneko Agirre,et al. Unsupervised Neural Machine Translation , 2017, ICLR.