Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation

Recent studies have revealed a number of pathologies of neural machine translation (NMT) systems. Hypotheses explaining these mostly suggest there is something fundamentally wrong with NMT as a model or its training algorithm, maximum likelihood estimation (MLE). Most of this evidence was gathered using maximum a posteriori (MAP) decoding, a decision rule aimed at identifying the highest-scoring translation, i.e. the mode. We argue that the evidence corroborates the inadequacy of MAP decoding more than casts doubt on the model and its training algorithm. In this work, we show that translation distributions do reproduce various statistics of the data well, but that beam search strays from such statistics. We show that some of the known pathologies and biases of NMT are due to MAP decoding and not to NMT’s statistical assumptions nor MLE. In particular, we show that the most likely translations under the model accumulate so little probability mass that the mode can be considered essentially arbitrary. We therefore advocate for the use of decision rules that take into account the translation distribution holistically. We show that an approximation to minimum Bayes risk decoding gives competitive results confirming that NMT models do capture important aspects of translation well in expectation.

[1]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[2]  Hideki Nakayama,et al.  Improving Beam Search by Removing Monotonic Constraint for Neural Machine Translation , 2018, ACL.

[3]  Shankar Kumar,et al.  Minimum Bayes-Risk Decoding for Statistical Machine Translation , 2004, NAACL.

[4]  Mingbo Ma,et al.  When to Finish? Optimal Beam Search for Neural Text Generation (modulo beam size) , 2017, EMNLP.

[5]  Wilker Aziz,et al.  Auto-Encoding Variational Neural Machine Translation , 2018, RepL4NLP@ACL.

[6]  Ondrej Bojar,et al.  Results of the WMT18 Metrics Shared Task: Both characters and embeddings achieve good performance , 2018, WMT.

[7]  Shankar Kumar,et al.  Lattice Minimum Bayes-Risk Decoding for Statistical Machine Translation , 2008, EMNLP.

[8]  Yang Liu,et al.  Minimum Risk Training for Neural Machine Translation , 2015, ACL.

[9]  Daphne Ippolito,et al.  Trading Off Diversity and Quality in Natural Language Generation , 2020, HUMEVAL.

[10]  Mark Steedman,et al.  Max-Margin Incremental CCG Parsing , 2020, ACL.

[11]  F. Blain,et al.  Exploring Hypotheses Spaces in Neural Machine Translation , 2017, MTSUMMIT.

[12]  Mark Steedman,et al.  A massively parallel corpus: the Bible in 100 languages , 2014, Lang. Resour. Evaluation.

[13]  Adrià de Gispert,et al.  Neural Machine Translation by Minimising the Bayes-risk with Respect to Syntactic Translation Lattices , 2016, EACL.

[14]  Yuhong Yang,et al.  Information Theory, Inference, and Learning Algorithms , 2005 .

[15]  Sunita Sarawagi,et al.  Length bias in Encoder Decoder Models and a Case for Global Conditioning , 2016, EMNLP.

[16]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[17]  Mohit Iyyer,et al.  Energy-Based Reranking: Improving Neural Machine Translation Using Energy-Based Models , 2020, ACL.

[18]  Marc'Aurelio Ranzato,et al.  Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[19]  Carolyn Pillers Dobler,et al.  Mathematical Statistics , 2002 .

[20]  Alexander M. Rush,et al.  Sequence-to-Sequence Learning as Beam-Search Optimization , 2016, EMNLP.

[21]  Mingbo Ma,et al.  Breaking the Beam Search Curse: A Study of (Re-)Scoring Methods and Stopping Criteria for Neural Machine Translation , 2018, EMNLP.

[22]  David M. Blei,et al.  Build, Compute, Critique, Repeat: Data Analysis with Latent Variable Models , 2014 .

[23]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[24]  Philipp Koehn,et al.  Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English , 2019, ArXiv.

[25]  Philipp Koehn,et al.  Findings of the 2018 Conference on Machine Translation (WMT18) , 2018, WMT.

[26]  Yann LeCun,et al.  Large Scale Online Learning , 2003, NIPS.

[27]  Dan Klein,et al.  Effective Inference for Generative Neural Parsing , 2017, EMNLP.

[28]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[29]  Vaibhava Goel,et al.  Minimum Bayes-risk automatic speech recognition , 2000, Comput. Speech Lang..

[30]  H. Robbins A Stochastic Approximation Method , 1951 .

[31]  Sunita Sarawagi,et al.  Calibration of Encoder Decoder Models for Neural Machine Translation , 2019, ArXiv.

[32]  Hideki Nakayama,et al.  Later-stage Minimum Bayes-Risk Decoding for Neural Machine Translation , 2017, ArXiv.

[33]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[35]  David Chiang,et al.  Correcting Length Bias in Neural Machine Translation , 2018, WMT.

[36]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[37]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[38]  Graham Neubig,et al.  Learning to Translate in Real-time with Neural Machine Translation , 2016, EACL.

[39]  J. Christopher Beck,et al.  Empirical Analysis of Beam Search Performance Degradation in Neural Sequence Models , 2019, ICML.

[40]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[41]  Philipp Koehn,et al.  Six Challenges for Neural Machine Translation , 2017, NMT@ACL.

[42]  Marc'Aurelio Ranzato,et al.  Analyzing Uncertainty in Neural Machine Translation , 2018, ICML.

[43]  John DeNero,et al.  Fast Consensus Decoding over Translation Forests , 2009, ACL.

[44]  Nils J. Nilsson,et al.  A Formal Basis for the Heuristic Determination of Minimum Cost Paths , 1968, IEEE Trans. Syst. Sci. Cybern..

[45]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[46]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[47]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[48]  Bill Byrne,et al.  On NMT Search Errors and Model Errors: Cat Got Your Tongue? , 2019, EMNLP.

[49]  Alon Lavie,et al.  Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems , 2011, WMT@EMNLP.

[50]  Jörg Tiedemann,et al.  OpenSubtitles2018: Statistical Rescoring of Sentence Alignments in Large, Noisy Parallel Corpora , 2018, LREC.

[51]  P. Bickel,et al.  Mathematical Statistics: Basic Ideas and Selected Topics , 1977 .

[52]  Noah A. Smith Linguistic Structure Prediction , 2011, Synthesis Lectures on Human Language Technologies.

[53]  Yoshua Bengio,et al.  High-dimensional sequence transduction , 2012, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[54]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[55]  Geoffrey E. Hinton,et al.  When Does Label Smoothing Help? , 2019, NeurIPS.

[56]  Guy Emerson,et al.  Leveraging Sentence Similarity in Natural Language Generation: Improving Beam Search using Range Voting , 2019, NGT.

[57]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[58]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[59]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[60]  Yang Feng,et al.  Bridging the Gap between Training and Inference for Neural Machine Translation , 2019, ACL.

[61]  Slav Petrov,et al.  Globally Normalized Transition-Based Neural Networks , 2016, ACL.

[62]  David Barber,et al.  Generative Neural Machine Translation , 2018, NeurIPS.