On The Evaluation of Machine Translation SystemsTrained With Back-Translation

Back-translation is a widely used data augmentation technique which leverages target monolingual data. However, its effectiveness has been challenged since automatic metrics such as BLEU only show significant improvements for test examples where the source itself is a translation, or translationese. This is believed to be due to translationese inputs better matching the back-translated training data. In this work, we show that this conjecture is not empirically supported and that back-translation improves translation quality of both naturally occurring text as well as translationese according to professional human translators. We provide empirical evidence to support the view that back-translation is preferred by humans because it produces more fluent outputs. BLEU cannot capture human preferences because references are translationese when source sentences are natural text. We recommend complementing BLEU with a language model score to measure fluency.

[1]  Mona Baker,et al.  'Corpus Linguistics and Translation Studies: Implications and Applications' , 1993 .

[2]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[3]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[4]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[5]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[6]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[7]  Cyril Goutte,et al.  Automatic Detection of Translated Text and its Impact on Machine Translation , 2009, MTSUMMIT.

[8]  Shuly Wintner,et al.  Language Models for Machine Translation: Original vs. Translated Texts , 2011, CL.

[9]  Moshe Koppel,et al.  Translationese and Its Dialects , 2011, ACL.

[10]  Ondrej Bojar,et al.  Improving Translation Model by Monolingual Data , 2011, WMT@EMNLP.

[11]  Gideon Toury Descriptive Translation Studies – and beyond: Revised edition , 2012 .

[12]  Timothy Baldwin,et al.  langid.py: An Off-the-shelf Language Identification Tool , 2012, ACL.

[13]  Khalil Sima'an,et al.  BEER: BEtter Evaluation as Ranking , 2014, WMT@ACL.

[14]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[15]  Timothy Baldwin,et al.  Can machine translation systems be evaluated by the crowd alone , 2015, Natural Language Engineering.

[16]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[17]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[18]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[19]  Sara Stymne,et al.  The Effect of Translationese on Tuning for Statistical Machine Translation , 2017, NODALIDA.

[20]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[21]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[22]  Guillaume Lample,et al.  Phrase-Based & Neural Unsupervised Machine Translation , 2018, EMNLP.

[23]  Philipp Koehn,et al.  Findings of the 2018 Conference on Machine Translation (WMT18) , 2018, WMT.

[24]  Myle Ott,et al.  Understanding Back-Translation at Scale , 2018, EMNLP.

[25]  Andy Way,et al.  Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Machine Translation , 2018, WMT.

[26]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[27]  Andy Way,et al.  Investigating Backtranslation in Neural Machine Translation , 2018, EAMT.

[28]  Myle Ott,et al.  Scaling Neural Machine Translation , 2018, WMT.

[29]  Markus Freitag,et al.  Text Repair Model for Neural Machine Translation , 2019, ArXiv.

[30]  Ondrej Bojar,et al.  Results of the WMT19 Metrics Shared Task: Segment-Level and Strong MT Systems Pose Big Challenges , 2019, WMT.

[31]  Myle Ott,et al.  Facebook FAIR’s WMT19 News Translation Task Submission , 2019, WMT.

[32]  Ciprian Chelba,et al.  Tagged Back-Translation , 2019, WMT.

[33]  Antonio Toral,et al.  The Effect of Translationese in Machine Translation Test Sets , 2019, WMT.

[34]  Alexei Baevski,et al.  Adaptive Input Representations for Neural Language Modeling , 2018, ICLR.

[35]  Marta R. Costa-jussà,et al.  Findings of the 2019 Conference on Machine Translation (WMT19) , 2019, WMT.

[36]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[37]  Philipp Koehn,et al.  Translationese in Machine Translation Evaluation , 2019, EMNLP.

[38]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.