High Quality Rather than High Model Probability: Minimum Bayes Risk Decoding with Neural Metrics

Abstract In Neural Machine Translation, it is typically assumed that the sentence with the highest estimated probability should also be the translation with the highest quality as measured by humans. In this work, we question this assumption and show that model estimates and translation quality only vaguely correlate. We apply Minimum Bayes Risk (MBR) decoding on unbiased samples to optimize diverse automated metrics of translation quality as an alternative inference strategy to beam search. Instead of targeting the hypotheses with the highest model probability, MBR decoding extracts the hypotheses with the highest estimated quality. Our experiments show that the combination of a neural translation model with a neural reference-based metric, Bleurt, results in significant improvement in human evaluations. This improvement is obtained with translations different from classical beam-search output: These translations have much lower model likelihood and are less favored by surface metrics like Bleu.

[1]  Wilker Aziz,et al.  Sampling-Based Minimum Bayes Risk Decoding for Neural Machine Translation , 2021, ArXiv.

[2]  Rico Sennrich,et al.  Understanding the Properties of Minimum Bayes Risk Decoding in Neural Machine Translation , 2021, ACL.

[3]  Markus Freitag,et al.  Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation , 2021, Transactions of the Association for Computational Linguistics.

[4]  Oriol Vinyals,et al.  Machine Translation Decoding beyond Beam Search , 2021, EMNLP.

[5]  Dan Roth,et al.  A Statistical Analysis of Summarization Evaluation Metrics Using Resampling Methods , 2021, Transactions of the Association for Computational Linguistics.

[6]  Sebastian Ruder,et al.  Rethinking embedding coupling in pre-trained language models , 2020, ICLR.

[7]  Alon Lavie,et al.  Are References Really Needed? Unbabel-IST 2021 Submission for the Metrics Shared Task , 2021, WMT.

[8]  Philipp Koehn,et al.  Facebook AI’s WMT21 News Translation Task Submission , 2021, WMT.

[9]  Antonio Valerio Miceli Barone,et al.  The University of Edinburgh’s English-German and English-Hausa Submissions to the WMT21 News Translation Task , 2021, WMT.

[10]  Ankur P. Parikh,et al.  Learning to Evaluate Translation Beyond English: BLEURT Submissions to the WMT Metrics 2020 Shared Task , 2020, WMT.

[11]  Alon Lavie,et al.  COMET: A Neural Framework for MT Evaluation , 2020, EMNLP.

[12]  Wilker Aziz,et al.  Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation , 2020, COLING.

[13]  Antonio Toral,et al.  Reassessing Claims of Human Parity and Super-Human Performance in Machine Translation at WMT 2019 , 2020, EAMT.

[14]  Markus Freitag,et al.  BLEU Might Be Guilty but References Are Not Innocent , 2020, EMNLP.

[15]  Thibault Sellam,et al.  BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.

[16]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[17]  Chi-kiu Lo Extended Study on Using Pretrained Language Models and YiSi-1 for Machine Translation Evaluation , 2020, WMT@EMNLP.

[18]  Markus Freitag,et al.  Results of the WMT20 Metrics Shared Task , 2020, WMT.

[19]  Marta R. Costa-jussà,et al.  Findings of the 2019 Conference on Machine Translation (WMT19) , 2019, WMT.

[20]  Chi-kiu Lo,et al.  YiSi - a Unified Semantic MT Quality Evaluation and Estimation Metric for Languages with Different Levels of Available Resources , 2019, WMT.

[21]  Tara N. Sainath,et al.  Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling , 2019, ArXiv.

[22]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[23]  Taro Watanabe,et al.  Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection , 2018, WMT.

[24]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[25]  Myle Ott,et al.  Scaling Neural Machine Translation , 2018, WMT.

[26]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[27]  Marc'Aurelio Ranzato,et al.  Classical Structured Prediction Losses for Sequence to Sequence Learning , 2017, NAACL.

[28]  Philipp Koehn,et al.  Neural Machine Translation , 2017, ArXiv.

[29]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[30]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[31]  Joelle Pineau,et al.  An Actor-Critic Algorithm for Sequence Prediction , 2016, ICLR.

[32]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[33]  Maja Popovic,et al.  chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.

[34]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[35]  A. Burchardt,et al.  Multidimensional Quality Metrics (MQM): A Framework for Declaring and Describing Translation Quality Metrics , 2014 .

[36]  Khalil Sima'an,et al.  Fitting Sentence Level Translation Evaluation with Many Dense Features , 2014, EMNLP.

[37]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[38]  Shankar Kumar,et al.  Lattice Minimum Bayes-Risk Decoding for Statistical Machine Translation , 2008, EMNLP.

[39]  David A. Smith,et al.  Minimum Risk Annealing for Training Log-Linear Models , 2006, ACL.

[40]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[41]  Chin-Yew Lin,et al.  ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation , 2004, COLING.

[42]  Shankar Kumar,et al.  Minimum Bayes-Risk Decoding for Statistical Machine Translation , 2004, NAACL.

[43]  Khalil Sima'an On maximizing metrics for syntactic disambiguation , 2003, IWPT.

[44]  Carolyn Pillers Dobler Mathematical Statistics: Basic Ideas and Selected Topics , 2002 .

[45]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[46]  Shankar Kumar,et al.  Minimum Bayes-Risk Word Alignments of Bilingual Texts , 2002, EMNLP.

[47]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[48]  Vaibhava Goel,et al.  Minimum Bayes-risk automatic speech recognition , 2000, Comput. Speech Lang..

[49]  Mitch Weintraub,et al.  Explicit word error minimization in n-best list rescoring , 1997, EUROSPEECH.

[50]  Joshua Goodman,et al.  Parsing Algorithms and Metrics , 1996, ACL.

[51]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[52]  James O. Berger,et al.  Statistical Decision Theory and Bayesian Analysis, Second Edition , 1985 .

[53]  P. Bickel,et al.  Mathematical Statistics: Basic Ideas and Selected Topics , 1977 .