Minimum Bayes Risk Decoding with Neural Metrics of Translation Quality

This work applies Minimum Bayes Risk (MBR) decoding to optimize diverse automated metrics of translation quality. Automatic metrics in machine translation have made tremendous progress recently. In particular, neural metrics, fine-tuned on human ratings (e.g. BLEURT, or COMET) are outperforming surface metrics in terms of correlations to human judgements. Our experiments show that the combination of a neural translation model with a neural referencebased metric, BLEURT, results in significant improvement in automatic and human evaluations. This improvement is obtained with translations different from classical beamsearch output: these translations have much lower likelihood and are less favored by surface metrics like BLEU.

[1]  Wilker Aziz,et al.  Sampling-Based Minimum Bayes Risk Decoding for Neural Machine Translation , 2021, ArXiv.

[2]  Rico Sennrich,et al.  Understanding the Properties of Minimum Bayes Risk Decoding in Neural Machine Translation , 2021, ACL.

[3]  Markus Freitag,et al.  Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation , 2021, Transactions of the Association for Computational Linguistics.

[4]  Oriol Vinyals,et al.  Machine Translation Decoding beyond Beam Search , 2021, EMNLP.

[5]  Dan Roth,et al.  A Statistical Analysis of Summarization Evaluation Metrics Using Resampling Methods , 2021, Transactions of the Association for Computational Linguistics.

[6]  Alon Lavie,et al.  Are References Really Needed? Unbabel-IST 2021 Submission for the Metrics Shared Task , 2021, WMT.

[7]  Antonio Valerio Miceli Barone,et al.  The University of Edinburgh’s English-German and English-Hausa Submissions to the WMT21 News Translation Task , 2021, WMT.

[8]  Mary Williamson,et al.  Facebook AI’s WMT20 News Translation Task Submission , 2020, WMT.

[9]  Ankur P. Parikh,et al.  Learning to Evaluate Translation Beyond English: BLEURT Submissions to the WMT Metrics 2020 Shared Task , 2020, WMT.

[10]  Alon Lavie,et al.  COMET: A Neural Framework for MT Evaluation , 2020, EMNLP.

[11]  Wilker Aziz,et al.  Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation , 2020, COLING.

[12]  Antonio Toral,et al.  Reassessing Claims of Human Parity and Super-Human Performance in Machine Translation at WMT 2019 , 2020, EAMT.

[13]  Markus Freitag,et al.  BLEU Might Be Guilty but References Are Not Innocent , 2020, EMNLP.

[14]  Thibault Sellam,et al.  BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.

[15]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[16]  Chi-kiu Lo Extended Study on Using Pretrained Language Models and YiSi-1 for Machine Translation Evaluation , 2020, WMT@EMNLP.

[17]  Markus Freitag,et al.  Results of the WMT20 Metrics Shared Task , 2020, WMT.

[18]  Marta R. Costa-jussà,et al.  Findings of the 2019 Conference on Machine Translation (WMT19) , 2019, WMT.

[19]  Chi-kiu Lo,et al.  YiSi - a Unified Semantic MT Quality Evaluation and Estimation Metric for Languages with Different Levels of Available Resources , 2019, WMT.

[20]  Tara N. Sainath,et al.  Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling , 2019, ArXiv.

[21]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[22]  Taro Watanabe,et al.  Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection , 2018, WMT.

[23]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[24]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[25]  Marc'Aurelio Ranzato,et al.  Analyzing Uncertainty in Neural Machine Translation , 2018, ICML.

[26]  Marc'Aurelio Ranzato,et al.  Classical Structured Prediction Losses for Sequence to Sequence Learning , 2017, NAACL.

[27]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[28]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[29]  Joelle Pineau,et al.  An Actor-Critic Algorithm for Sequence Prediction , 2016, ICLR.

[30]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[31]  Maja Popovic,et al.  chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.

[32]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[33]  A. Burchardt,et al.  Multidimensional Quality Metrics (MQM): A Framework for Declaring and Describing Translation Quality Metrics , 2014 .

[34]  Khalil Sima'an,et al.  Fitting Sentence Level Translation Evaluation with Many Dense Features , 2014, EMNLP.

[35]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[36]  Shankar Kumar,et al.  Lattice Minimum Bayes-Risk Decoding for Statistical Machine Translation , 2008, EMNLP.

[37]  David A. Smith,et al.  Minimum Risk Annealing for Training Log-Linear Models , 2006, ACL.

[38]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[39]  William Byrne,et al.  Minimum bayes-risk techniques in automatic speech recognition and statistical machine translation , 2005 .

[40]  Chin-Yew Lin,et al.  ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation , 2004, COLING.

[41]  Shankar Kumar,et al.  Minimum Bayes-Risk Decoding for Statistical Machine Translation , 2004, NAACL.

[42]  Khalil Sima'an On maximizing metrics for syntactic disambiguation , 2003, IWPT.

[43]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[44]  Shankar Kumar,et al.  Minimum Bayes-Risk Word Alignments of Bilingual Texts , 2002, EMNLP.

[45]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[46]  Hoon Kim,et al.  Monte Carlo Statistical Methods , 2000, Technometrics.

[47]  Vaibhava Goel,et al.  Minimum Bayes-risk automatic speech recognition , 2000, Comput. Speech Lang..

[48]  Mitch Weintraub,et al.  Explicit word error minimization in n-best list rescoring , 1997, EUROSPEECH.

[49]  Joshua Goodman,et al.  Parsing Algorithms and Metrics , 1996, ACL.

[50]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .