论文信息 - Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation

Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation

Recent studies have revealed a number of pathologies of neural machine translation (NMT) systems. Hypotheses explaining these mostly suggest that there is something fundamentally wrong with NMT as a model or its training algorithm, maximum likelihood estimation (MLE). Most of this evidence was gathered using maximum a posteriori (MAP) decoding, a decision rule aimed at identifying the highest-scoring translation, i.e. the mode, under the model distribution. We argue that the evidence corroborates the inadequacy of MAP decoding more than casts doubt on the model and its training algorithm. In this work, we criticise NMT models probabilistically showing that stochastic samples following the model's own generative story do reproduce various statistics of the training data well, but that it is beam search that strays from such statistics. We show that some of the known pathologies of NMT are due to MAP decoding and not to NMT's statistical assumptions nor MLE. In particular, we show that the most likely translations under the model accumulate so little probability mass that the mode can be considered essentially arbitrary. We therefore advocate for the use of decision rules that take into account statistics gathered from the model distribution holistically. As a proof of concept we show that a straightforward implementation of minimum Bayes risk decoding gives good results outperforming beam search using as little as 30 samples, confirming that MLE-trained NMT models do capture important aspects of translation well in expectation.

Wilker Aziz | Bryan Eikema | Wilker Aziz | Bryan Eikema

[1] Adrià de Gispert,et al. Neural Machine Translation by Minimising the Bayes-risk with Respect to Syntactic Translation Lattices , 2016, EACL.

[2] Marc'Aurelio Ranzato,et al. Analyzing Uncertainty in Neural Machine Translation , 2018, ICML.

[3] Yejin Choi,et al. The Curious Case of Neural Text Degeneration , 2019, ICLR.

[4] Yoshua Bengio,et al. On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[5] Dan Klein,et al. Effective Inference for Generative Neural Parsing , 2017, EMNLP.

[6] Yang Liu,et al. Minimum Risk Training for Neural Machine Translation , 2015, ACL.

[7] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Graham Neubig,et al. Learning to Translate in Real-time with Neural Machine Translation , 2016, EACL.

[9] Yang Feng,et al. Bridging the Gap between Training and Inference for Neural Machine Translation , 2019, ACL.

[10] P. Bickel,et al. Mathematical Statistics: Basic Ideas and Selected Topics , 1977 .

[11] Ondrej Bojar,et al. Results of the WMT18 Metrics Shared Task: Both characters and embeddings achieve good performance , 2018, WMT.

[12] David Barber,et al. Generative Neural Machine Translation , 2018, NeurIPS.

[13] Mingbo Ma,et al. Breaking the Beam Search Curse: A Study of (Re-)Scoring Methods and Stopping Criteria for Neural Machine Translation , 2018, EMNLP.

[14] David M. Blei,et al. Build, Compute, Critique, Repeat: Data Analysis with Latent Variable Models , 2014 .

[15] Hideki Nakayama,et al. Later-stage Minimum Bayes-Risk Decoding for Neural Machine Translation , 2017, ArXiv.

[16] John K Kruschke,et al. Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[17] Bill Byrne,et al. On NMT Search Errors and Model Errors: Cat Got Your Tongue? , 2019, EMNLP.

[18] Slav Petrov,et al. Globally Normalized Transition-Based Neural Networks , 2016, ACL.

[19] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[20] Wilker Aziz,et al. Auto-Encoding Variational Neural Machine Translation , 2018, RepL4NLP@ACL.

[21] Jörg Tiedemann,et al. Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[22] Yoshua Bengio,et al. High-dimensional sequence transduction , 2012, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23] H. Robbins. A Stochastic Approximation Method , 1951 .

[24] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[25] Yuhong Yang,et al. Information Theory, Inference, and Learning Algorithms , 2005 .

[26] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[27] Chong Wang,et al. Stochastic variational inference , 2012, J. Mach. Learn. Res..

[28] Mark Steedman,et al. Max-Margin Incremental CCG Parsing , 2020, ACL.

[29] David Chiang,et al. Correcting Length Bias in Neural Machine Translation , 2018, WMT.

[30] Yann LeCun,et al. Large Scale Online Learning , 2003, NIPS.

[31] Carolyn Pillers Dobler,et al. Mathematical Statistics , 2002 .

[32] David J. C. MacKay,et al. Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[33] Noah A. Smith. Linguistic Structure Prediction , 2011, Synthesis Lectures on Human Language Technologies.

[34] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..