Minimum Bayes Risk Decoding with Neural Metrics of Translation Quality

This work applies Minimum Bayes Risk (MBR) decoding to optimize diverse automated metrics of translation quality. Automatic metrics in machine translation have made tremendous progress recently. In particular, neural metrics, fine-tuned on human ratings (e.g. BLEURT, or COMET) are outperforming surface metrics in terms of correlations to human judgements. Our experiments show that the combination of a neural translation model with a neural referencebased metric, BLEURT, results in significant improvement in automatic and human evaluations. This improvement is obtained with translations different from classical beamsearch output: these translations have much lower likelihood and are less favored by surface metrics like BLEU.

[1]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[2]  Rico Sennrich,et al.  Understanding the Properties of Minimum Bayes Risk Decoding in Neural Machine Translation , 2021, ACL.

[3]  Markus Freitag,et al.  BLEU Might Be Guilty but References Are Not Innocent , 2020, EMNLP.

[4]  Mary Williamson,et al.  Facebook AI’s WMT20 News Translation Task Submission , 2020, WMT.

[5]  Oriol Vinyals,et al.  Machine Translation Decoding beyond Beam Search , 2021, EMNLP.

[6]  Tara N. Sainath,et al.  Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling , 2019, ArXiv.

[7]  Wilker Aziz,et al.  Sampling-Based Minimum Bayes Risk Decoding for Neural Machine Translation , 2021, ArXiv.

[8]  Hoon Kim,et al.  Monte Carlo Statistical Methods , 2000, Technometrics.

[9]  Antonio Toral,et al.  Reassessing Claims of Human Parity and Super-Human Performance in Machine Translation at WMT 2019 , 2020, EAMT.

[10]  Joshua Goodman,et al.  Parsing Algorithms and Metrics , 1996, ACL.

[11]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[12]  Shankar Kumar,et al.  Minimum Bayes-Risk Decoding for Statistical Machine Translation , 2004, NAACL.

[13]  Markus Freitag,et al.  Learning to Evaluate Translation Beyond English: BLEURT Submissions to the WMT Metrics 2020 Shared Task , 2020, WMT@EMNLP.

[14]  Shankar Kumar,et al.  Lattice Minimum Bayes-Risk Decoding for Statistical Machine Translation , 2008, EMNLP.

[15]  Khalil Sima'an On maximizing metrics for syntactic disambiguation , 2003, IWPT.

[16]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[17]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[18]  Maja Popovic,et al.  chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.

[19]  A. Burchardt,et al.  Multidimensional Quality Metrics (MQM): A Framework for Declaring and Describing Translation Quality Metrics , 2014 .

[20]  Chi-kiu Lo,et al.  YiSi - a Unified Semantic MT Quality Evaluation and Estimation Metric for Languages with Different Levels of Available Resources , 2019, WMT.

[21]  Taro Watanabe,et al.  Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection , 2018, WMT.

[22]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[23]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[24]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[25]  Alon Lavie,et al.  COMET: A Neural Framework for MT Evaluation , 2020, EMNLP.

[26]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[27]  Thibault Sellam,et al.  BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.

[28]  Chin-Yew Lin,et al.  ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation , 2004, COLING.

[29]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[30]  Marc'Aurelio Ranzato,et al.  Analyzing Uncertainty in Neural Machine Translation , 2018, ICML.

[31]  David A. Smith,et al.  Minimum Risk Annealing for Training Log-Linear Models , 2006, ACL.

[32]  Joelle Pineau,et al.  An Actor-Critic Algorithm for Sequence Prediction , 2016, ICLR.

[33]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[34]  Shankar Kumar,et al.  Minimum Bayes-Risk Word Alignments of Bilingual Texts , 2002, EMNLP.

[35]  Marta R. Costa-jussà,et al.  Findings of the 2019 Conference on Machine Translation (WMT19) , 2019, WMT.

[36]  Marc'Aurelio Ranzato,et al.  Classical Structured Prediction Losses for Sequence to Sequence Learning , 2017, NAACL.

[37]  Dan Roth,et al.  A Statistical Analysis of Summarization Evaluation Metrics Using Resampling Methods , 2021, Transactions of the Association for Computational Linguistics.

[38]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[39]  Chi-kiu Lo Extended Study on Using Pretrained Language Models and YiSi-1 for Machine Translation Evaluation , 2020, WMT@EMNLP.

[40]  William Byrne,et al.  Minimum bayes-risk techniques in automatic speech recognition and statistical machine translation , 2005 .

[41]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[42]  Markus Freitag,et al.  Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation , 2021, Transactions of the Association for Computational Linguistics.

[43]  Khalil Sima'an,et al.  Fitting Sentence Level Translation Evaluation with Many Dense Features , 2014, EMNLP.

[44]  Vaibhava Goel,et al.  Minimum Bayes-risk automatic speech recognition , 2000, Comput. Speech Lang..

[45]  Antonio Valerio Miceli Barone,et al.  The University of Edinburgh’s English-German and English-Hausa Submissions to the WMT21 News Translation Task , 2021, WMT.

[46]  Alon Lavie,et al.  Are References Really Needed? Unbabel-IST 2021 Submission for the Metrics Shared Task , 2021, WMT.

[47]  Mitch Weintraub,et al.  Explicit word error minimization in n-best list rescoring , 1997, EUROSPEECH.

[48]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[49]  Markus Freitag,et al.  Results of the WMT20 Metrics Shared Task , 2020, WMT.

[50]  Wilker Aziz,et al.  Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation , 2020, COLING.