BLEU Might Be Guilty but References Are Not Innocent

The quality of automatic metrics for machine translation has been increasingly called into question, especially for high-quality systems. This paper demonstrates that, while choice of metric is important, the nature of the references is also critical. We study different methods to collect references and compare their value in automated evaluation by reporting correlation with human evaluation for a variety of systems and metrics. Motivated by the finding that typical references exhibit poor diversity, concentrating around translationese language, we develop a paraphrasing task for linguists to perform on existing reference translations, which counteracts this bias. Our method yields higher correlation with human judgment not only for the submissions of WMT 2019 English to German, but also for Back-translation and APE augmented MT output, which have been shown to have low correlation with automatic metrics using standard references. We demonstrate that our methodology improves correlation with all modern evaluation metrics we look at, including embedding-based methods. To complete this picture, we reveal that multi-reference BLEU does not improve the correlation for high quality output, and present an alternative multi-reference formulation that is more effective.

[1]  Daniel Marcu,et al.  HyTER: Meaning-Equivalent Semantics for Translation Evaluation , 2012, NAACL.

[2]  Markus Freitag,et al.  Translationese as a Language in “Multilingual” NMT , 2020, ACL.

[3]  Maja Popovic,et al.  On reducing translation shifts in translations intended for MT evaluation , 2019, MTSummit.

[4]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[5]  Rico Sennrich,et al.  Domain, Translationese and Noise in Synthetic Data for Neural Machine Translation , 2019, ArXiv.

[6]  Philipp Koehn,et al.  Manual and Automatic Evaluation of Machine Translation between European Languages , 2006, WMT@HLT-NAACL.

[7]  Timothy Baldwin,et al.  Continuous Measurement Scales in Human Evaluation of Machine Translation , 2013, LAW@ACL.

[8]  Timothy Baldwin,et al.  Further Investigation into Reference Bias in Monolingual Evaluation of Machine Translation , 2017, EMNLP.

[9]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[10]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[11]  Chris Callison-Burch,et al.  Improved Statistical Machine Translation Using Monolingually-Derived Paraphrases , 2009, EMNLP.

[12]  Maja Popovic,et al.  chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.

[13]  Bogdan Babych,et al.  Extending the BLEU MT Evaluation Method with Frequency Weightings , 2004, ACL.

[14]  Ondrej Bojar,et al.  Results of the WMT19 Metrics Shared Task: Segment-Level and Strong MT Systems Pose Big Challenges , 2019, WMT.

[15]  Yann Dauphin,et al.  A Convolutional Encoder Model for Neural Machine Translation , 2016, ACL.

[16]  Myle Ott,et al.  Understanding Back-Translation at Scale , 2018, EMNLP.

[17]  F. Scarpa,et al.  Corpus-based Quality Assessment of Specialist Translation: A Study Using Parallel and Comparable Corpora in English and Italian , 2006 .

[18]  Chris Callison-Burch,et al.  Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk , 2009, EMNLP.

[19]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[20]  Noah A. Smith,et al.  A Simple, Fast, and Effective Reparameterization of IBM Model 2 , 2013, NAACL.

[21]  Miles Osborne,et al.  Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[22]  Hermann Ney,et al.  An Evaluation Tool for Machine Translation: Fast Evaluation for MT Research , 2000, LREC.

[23]  Philipp Koehn,et al.  Findings of the 2018 Conference on Machine Translation (WMT18) , 2018, WMT.

[24]  Markus Freitag,et al.  APE at Scale and Its Implications on MT Evaluation Biases , 2019, WMT.

[25]  Martin Gellerstam,et al.  Translationese in Swedish novels translated from English , 1986 .

[26]  Yang Liu,et al.  Learning to Remember Translation History with a Continuous Cache , 2017, TACL.

[27]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[28]  Philipp Koehn,et al.  Translationese in Machine Translation Evaluation , 2019, EMNLP.

[29]  Chunyu Kit,et al.  Comparative Evaluation of Term Informativeness Measures in Machine Translation Evaluation Metrics , 2011, MTSUMMIT.

[30]  Marta R. Costa-jussà,et al.  Findings of the 2019 Conference on Machine Translation (WMT19) , 2019, WMT.

[31]  Antonio Toral,et al.  The Effect of Translationese in Machine Translation Test Sets , 2019, WMT.

[32]  Cyril Goutte,et al.  Automatic Detection of Translated Text and its Impact on Machine Translation , 2009, MTSUMMIT.

[33]  Matt Post,et al.  Automatic Machine Translation Evaluation in Many Languages via Zero-Shot Paraphrasing , 2020, EMNLP.

[34]  Lucia Specia,et al.  Reference Bias in Monolingual Machine Translation Evaluation , 2016, ACL.

[35]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[36]  Richard M. Schwartz,et al.  Systematic Comparison of Professional and Crowdsourced Reference Translations for Machine Translation , 2013, NAACL.

[37]  Moshe Koppel,et al.  Translationese and Its Dialects , 2011, ACL.

[38]  Karin M. Verspoor,et al.  Findings of the 2016 Conference on Machine Translation , 2016, WMT.

[39]  Chi-kiu Lo,et al.  YiSi - a Unified Semantic MT Quality Evaluation and Estimation Metric for Languages with Different Levels of Available Resources , 2019, WMT.

[40]  Myle Ott,et al.  On The Evaluation of Machine Translation SystemsTrained With Back-Translation , 2019, ACL.

[41]  Philipp Koehn,et al.  Ten Years of WMT Evaluation Campaigns: Lessons Learnt , 2016 .

[42]  Sarah Bawa Mason Joss Moorkens, Sheila Castilho, Federico Gaspari, Stephen Doherty (eds): Translation quality assessment: from principles to practice , 2019, Machine Translation.

[43]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[44]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[45]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[46]  Aurko Roy,et al.  Unsupervised Paraphrasing without Translation , 2019, ACL.

[47]  Liang Zhou,et al.  Re-evaluating Machine Translation Results with Paraphrase Support , 2006, EMNLP.

[48]  Nitin Madnani,et al.  Using Paraphrases for Parameter Tuning in Statistical Machine Translation , 2007, WMT@ACL.

[49]  Antonio Toral,et al.  Post-editese: an Exacerbated Translationese , 2019, MTSummit.

[50]  S. Laviosa How Comparable Can 'Comparable Corpora' Be? , 1997 .

[51]  Regina Barzilay,et al.  Paraphrasing for Automatic Evaluation , 2006, NAACL.

[52]  Gideon Toury Descriptive Translation Studies – and beyond: Revised edition , 2012 .

[53]  John S. White,et al.  The ARPA MT Evaluation Methodologies: Evolution, Lessons, and Future Approaches , 1994, AMTA.

[54]  Preslav Nakov,et al.  Pronoun-Focused MT and Cross-Lingual Pronoun Prediction: Findings of the 2015 DiscoMT Shared Task on Pronoun Translation , 2015, DiscoMT@EMNLP.

[55]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[56]  Timothy Baldwin,et al.  Can machine translation systems be evaluated by the crowd alone , 2015, Natural Language Engineering.

[57]  Marine Carpuat,et al.  The Trouble with SMT Consistency , 2012, WMT@NAACL-HLT.

[58]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[59]  Toru Ishida,et al.  Crowdsourcing for Evaluating Machine Translation Quality , 2014, LREC.

[60]  Shuly Wintner,et al.  Adapting Translation Models to Translationese Improves SMT , 2012, EACL.

[61]  Philipp Koehn,et al.  Further Meta-Evaluation of Machine Translation , 2008, WMT@ACL.

[62]  Matt Post,et al.  Explicit Representation of the Translation Space: Automatic Paraphrasing for Machine Translation Evaluation , 2020, ArXiv.

[63]  Andy Way,et al.  Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Machine Translation , 2018, WMT.