Auto-Encoding Variational Neural Machine Translation

Translation data is often a byproduct of mixing different sources of data. This could be intentional such as by mixing data of different domains or including back-translated monolingual data, but often also is the result of how the bilingual dataset was constructed: a combination of different documents independently translated in different translation directions, by different translators, agencies, etc. Most neural machine translation models do not explicitly account for such variation in their probabilistic model. We attempt to model this by proposing a deep generative model that generates source and target sentences jointly from a shared sentence-level latent representation. The latent representation is designed to capture variations in the data distribution and allows the model to adjust its language and translation model accordingly. We show that such a model leads to superior performance over a strong conditional neural machine translation baseline in three settings: in-domain training where the training and test data are of the same domain, mixed-domain training where we train on a mix of domains and test on each domain separately, and in-domain training where we also include synthetic (noisy) back-translated data. We furthermore extend the model to be used in a semi-supervised setting in order to incorporate target monolingual data during training. Doing this we derive the commonly employed backtranslation heuristic in the form of a variational approximation to the posterior over the missing source sentence. This allows for training the back-translation network jointly with the rest of the model on a shared objective designed for source-to-target translation with minimal need of pre-processing. We find that the performance of this approach is not on par with the back-translation heuristic, but does lead to improvement over a model trained on bilingual data alone.

[1]  David Barber,et al.  Generative Neural Machine Translation , 2018, NeurIPS.

[2]  Ryan Cotterell,et al.  Explaining and Generalizing Back-Translation through Wake-Sleep , 2018, ArXiv.

[3]  Trevor Cohn,et al.  A Stochastic Decoder for Neural Machine Translation , 2018, ACL.

[4]  Tomas Mikolov,et al.  Learning Word Vectors for 157 Languages , 2018, LREC.

[5]  Tomas Mikolov,et al.  Advances in Pre-Training Distributed Word Representations , 2017, LREC.

[6]  Alexander A. Alemi,et al.  Fixing a Broken ELBO , 2017, ICML.

[7]  David Duvenaud,et al.  Backpropagation through the Void: Optimizing control variates for black-box gradient estimation , 2017, ICLR.

[8]  Philipp Koehn,et al.  Findings of the 2017 Conference on Machine Translation (WMT17) , 2017, WMT.

[9]  Felix Hieber,et al.  Using Target-side Monolingual Data for Neural Machine Translation through Multi-task Learning , 2017, EMNLP.

[10]  Ondrej Bojar,et al.  Results of the WMT17 Metrics Shared Task , 2017, WMT.

[11]  Antonio Valerio Miceli Barone,et al.  The University of Edinburgh’s Neural MT Systems for WMT17 , 2017, WMT.

[12]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[13]  Graham Neubig,et al.  Stronger Baselines for Trustable Results in Neural Machine Translation , 2017, NMT@ACL.

[14]  Philipp Koehn,et al.  Six Challenges for Neural Machine Translation , 2017, NMT@ACL.

[15]  Pierre Isabelle,et al.  A Challenge Set Approach to Evaluating Machine Translation , 2017, EMNLP.

[16]  Ivan Titov,et al.  Emergence of Language with Multi-agent Games: Learning to Communicate with Sequences of Symbols , 2017, NIPS.

[17]  Jascha Sohl-Dickstein,et al.  REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models , 2017, NIPS.

[18]  Adrià de Gispert,et al.  Neural Machine Translation by Minimising the Bayes-risk with Respect to Syntactic Translation Lattices , 2016, EACL.

[19]  Martin Wattenberg,et al.  Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation , 2016, TACL.

[20]  Yann Dauphin,et al.  A Convolutional Encoder Model for Neural Machine Translation , 2016, ACL.

[21]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[22]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[23]  Tie-Yan Liu,et al.  Dual Learning for Machine Translation , 2016, NIPS.

[24]  Jiajun Zhang,et al.  Exploiting Source-side Monolingual Data in Neural Machine Translation , 2016, EMNLP.

[25]  Karin M. Verspoor,et al.  Findings of the 2016 Conference on Machine Translation , 2016, WMT.

[26]  Ondrej Bojar,et al.  Results of the WMT13 Metrics Shared Task , 2016, WMT@EMNLP.

[27]  Maosong Sun,et al.  Semi-Supervised Learning for Neural Machine Translation , 2016, ACL.

[28]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[29]  Min Zhang,et al.  Variational Neural Machine Translation , 2016, EMNLP.

[30]  Dustin Tran,et al.  Automatic Differentiation Variational Inference , 2016, J. Mach. Learn. Res..

[31]  Ole Winther,et al.  Ladder Variational Autoencoders , 2016, NIPS.

[32]  Honglak Lee,et al.  Learning Structured Output Representation using Deep Conditional Generative Models , 2015, NIPS.

[33]  S. Chopra,et al.  Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[34]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[35]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[36]  Philipp Koehn,et al.  Results of the WMT15 Metrics Shared Task , 2015, WMT@EMNLP.

[37]  Alexandra Birch,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[38]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[39]  Hal Daumé,et al.  Deep Unordered Composition Rivals Syntactic Methods for Text Classification , 2015, ACL.

[40]  Richard Socher,et al.  Ask Me Anything: Dynamic Memory Networks for Natural Language Processing , 2015, ICML.

[41]  Yoshua Bengio,et al.  On Using Monolingual Corpora in Neural Machine Translation , 2015, ArXiv.

[42]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[43]  Khalil Sima'an,et al.  Fitting Sentence Level Translation Evaluation with Many Dense Features , 2014, EMNLP.

[44]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[45]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[46]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[47]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[48]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[49]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[50]  Karol Gregor,et al.  Neural Variational Inference and Learning in Belief Networks , 2014, ICML.

[51]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[52]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[53]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[54]  Phil Blunsom,et al.  Recurrent Continuous Translation Models , 2013, EMNLP.

[55]  Ondrej Bojar,et al.  Results of the WMT14 Metrics Shared Task , 2013 .

[56]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[57]  Alon Lavie,et al.  Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems , 2011, WMT@EMNLP.

[58]  Noah A. Smith Linguistic Structure Prediction , 2011, Synthesis Lectures on Human Language Technologies.

[59]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[60]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[61]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[62]  Jörg Tiedemann,et al.  News from OPUS — A collection of multilingual parallel corpora with tools and interfaces , 2009 .

[63]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[64]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[65]  Silvia Bernardini,et al.  A New Approach to the Study of Translationese: Machine-learning the Difference between Original and Translated Text , 2005, Lit. Linguistic Comput..

[66]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[67]  Shankar Kumar,et al.  Minimum Bayes-Risk Decoding for Statistical Machine Translation , 2004, NAACL.

[68]  Yann LeCun,et al.  Large Scale Online Learning , 2003, NIPS.

[69]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[70]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[71]  Geoffrey E. Hinton,et al.  The "wake-sleep" algorithm for unsupervised neural networks. , 1995, Science.

[72]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[73]  H. Robbins A Stochastic Approximation Method , 1951 .

[74]  Kenneth Heafield,et al.  Copied Monolingual Data Improves Low-Resource Neural Machine Translation , 2017, WMT.

[75]  Ondrej Bojar,et al.  Results of the WMT16 Metrics Shared Task , 2016 .

[76]  Marcello Federico,et al.  Report on the 11th IWSLT evaluation campaign , 2014, IWSLT.

[77]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[78]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[79]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[80]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[81]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[82]  Matthew J. Beal Variational algorithms for approximate Bayesian inference , 2003 .