BLEU Might Be Guilty but References Are Not Innocent

The quality of automatic metrics for machine translation has been increasingly called into question, especially for high-quality systems. This paper demonstrates that, while choice of metric is important, the nature of the references is also critical. We study different methods to collect references and compare their value in automated evaluation by reporting correlation with human evaluation for a variety of systems and metrics. Motivated by the finding that typical references exhibit poor diversity, concentrating around translationese language, we develop a paraphrasing task for linguists to perform on existing reference translations, which counteracts this bias. Our method yields higher correlation with human judgment not only for the submissions of WMT 2019 English to German, but also for Back-translation and APE augmented MT output, which have been shown to have low correlation with automatic metrics using standard references. We demonstrate that our methodology improves correlation with all modern evaluation metrics we look at, including embedding-based methods. To complete this picture, we reveal that multi-reference BLEU does not improve the correlation for high quality output, and present an alternative multi-reference formulation that is more effective.

[1]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[2]  Maja Popovic,et al.  On reducing translation shifts in translations intended for MT evaluation , 2019, MTSummit.

[3]  Regina Barzilay,et al.  Paraphrasing for Automatic Evaluation , 2006, NAACL.

[4]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[5]  Philipp Koehn,et al.  Manual and Automatic Evaluation of Machine Translation between European Languages , 2006, WMT@HLT-NAACL.

[6]  Maja Popovic,et al.  chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.

[7]  Cyril Goutte,et al.  Automatic Detection of Translated Text and its Impact on Machine Translation , 2009, MTSUMMIT.

[8]  Ondrej Bojar,et al.  Results of the WMT19 Metrics Shared Task: Segment-Level and Strong MT Systems Pose Big Challenges , 2019, WMT.

[9]  Preslav Nakov,et al.  Pronoun-Focused MT and Cross-Lingual Pronoun Prediction: Findings of the 2015 DiscoMT Shared Task on Pronoun Translation , 2015, DiscoMT@EMNLP.

[10]  Miles Osborne,et al.  Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[11]  Antonio Toral,et al.  The Effect of Translationese in Machine Translation Test Sets , 2019, WMT.

[12]  Yang Liu,et al.  Learning to Remember Translation History with a Continuous Cache , 2017, TACL.

[13]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[14]  Philipp Koehn,et al.  Findings of the 2018 Conference on Machine Translation (WMT18) , 2018, WMT.

[15]  Philipp Koehn,et al.  Translationese in Machine Translation Evaluation , 2019, EMNLP.

[16]  John S. White,et al.  The ARPA MT Evaluation Methodologies: Evolution, Lessons, and Future Approaches , 1994, AMTA.

[17]  Yann Dauphin,et al.  A Convolutional Encoder Model for Neural Machine Translation , 2016, ACL.

[18]  Shuly Wintner,et al.  Adapting Translation Models to Translationese Improves SMT , 2012, EACL.

[19]  S. Laviosa How Comparable Can 'Comparable Corpora' Be? , 1997 .

[20]  Chris Callison-Burch,et al.  Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk , 2009, EMNLP.

[21]  Martin Gellerstam,et al.  Translationese in Swedish novels translated from English , 1986 .

[22]  Chunyu Kit,et al.  Comparative Evaluation of Term Informativeness Measures in Machine Translation Evaluation Metrics , 2011, MTSUMMIT.

[23]  Rico Sennrich,et al.  Domain, Translationese and Noise in Synthetic Data for Neural Machine Translation , 2019, ArXiv.

[24]  Timothy Baldwin,et al.  Can machine translation systems be evaluated by the crowd alone , 2015, Natural Language Engineering.

[25]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[26]  Timothy Baldwin,et al.  Continuous Measurement Scales in Human Evaluation of Machine Translation , 2013, LAW@ACL.

[27]  Noah A. Smith,et al.  A Simple, Fast, and Effective Reparameterization of IBM Model 2 , 2013, NAACL.

[28]  Matt Post,et al.  Explicit Representation of the Translation Space: Automatic Paraphrasing for Machine Translation Evaluation , 2020, ArXiv.

[29]  Sarah Bawa Mason Joss Moorkens, Sheila Castilho, Federico Gaspari, Stephen Doherty (eds): Translation quality assessment: from principles to practice , 2019, Machine Translation.

[30]  Aurko Roy,et al.  Unsupervised Paraphrasing without Translation , 2019, ACL.

[31]  Marine Carpuat,et al.  The Trouble with SMT Consistency , 2012, WMT@NAACL-HLT.

[32]  Antonio Toral,et al.  Post-editese: an Exacerbated Translationese , 2019, MTSummit.

[33]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[34]  Richard M. Schwartz,et al.  Systematic Comparison of Professional and Crowdsourced Reference Translations for Machine Translation , 2013, NAACL.

[35]  Toru Ishida,et al.  Crowdsourcing for Evaluating Machine Translation Quality , 2014, LREC.

[36]  Chi-kiu Lo,et al.  YiSi - a Unified Semantic MT Quality Evaluation and Estimation Metric for Languages with Different Levels of Available Resources , 2019, WMT.

[37]  Bogdan Babych,et al.  Extending the BLEU MT Evaluation Method with Frequency Weightings , 2004, ACL.

[38]  Myle Ott,et al.  Understanding Back-Translation at Scale , 2018, EMNLP.

[39]  Lucia Specia,et al.  Reference Bias in Monolingual Machine Translation Evaluation , 2016, ACL.

[40]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[41]  F. Scarpa,et al.  Corpus-based Quality Assessment of Specialist Translation: A Study Using Parallel and Comparable Corpora in English and Italian , 2006 .

[42]  Liang Zhou,et al.  Re-evaluating Machine Translation Results with Paraphrase Support , 2006, EMNLP.

[43]  Nitin Madnani,et al.  Using Paraphrases for Parameter Tuning in Statistical Machine Translation , 2007, WMT@ACL.

[44]  Andy Way,et al.  Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Machine Translation , 2018, WMT.

[45]  Karin M. Verspoor,et al.  Findings of the 2016 Conference on Machine Translation , 2016, WMT.

[46]  Hermann Ney,et al.  An Evaluation Tool for Machine Translation: Fast Evaluation for MT Research , 2000, LREC.

[47]  Philipp Koehn,et al.  Further Meta-Evaluation of Machine Translation , 2008, WMT@ACL.

[48]  Markus Freitag,et al.  APE at Scale and Its Implications on MT Evaluation Biases , 2019, WMT.

[49]  Gideon Toury Descriptive Translation Studies – and beyond: Revised edition , 2012 .

[50]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[51]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[52]  Myle Ott,et al.  On The Evaluation of Machine Translation SystemsTrained With Back-Translation , 2019, ACL.

[53]  Philipp Koehn,et al.  Ten Years of WMT Evaluation Campaigns: Lessons Learnt , 2016 .

[54]  Timothy Baldwin,et al.  Further Investigation into Reference Bias in Monolingual Evaluation of Machine Translation , 2017, EMNLP.

[55]  Chris Callison-Burch,et al.  Improved Statistical Machine Translation Using Monolingually-Derived Paraphrases , 2009, EMNLP.

[56]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[57]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[58]  Moshe Koppel,et al.  Translationese and Its Dialects , 2011, ACL.

[59]  Markus Freitag,et al.  Translationese as a Language in “Multilingual” NMT , 2019, ACL.

[60]  Matt Post,et al.  Automatic Machine Translation Evaluation in Many Languages via Zero-Shot Paraphrasing , 2020, EMNLP.

[61]  Daniel Marcu,et al.  HyTER: Meaning-Equivalent Semantics for Translation Evaluation , 2012, NAACL.

[62]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.