Human-Paraphrased References Improve Neural Machine Translation

Automatic evaluation comparing candidate translations to human-generated paraphrases of reference translations has recently been proposed by Freitag et al. When used in place of original references, the paraphrased versions produce metric scores that correlate better with human judgment. This effect holds for a variety of different automatic metrics, and tends to favor natural formulations over more literal (translationese) ones. In this paper we compare the results of performing end-to-end system development using standard and paraphrased references. With state-of-the-art English-German NMT components, we show that tuning to paraphrased references produces a system that is significantly better according to human judgment, but 5 BLEU points worse when tested on standard references. Our work confirms the finding that paraphrased references yield metric scores that correlate better with human judgment, and demonstrates for the first time that using these scores for system development can lead to significant improvements.

[1]  Chi-kiu Lo,et al.  YiSi - a Unified Semantic MT Quality Evaluation and Estimation Metric for Languages with Different Levels of Available Resources , 2019, WMT.

[2]  Holger Schwenk,et al.  Optimising Multiple Metrics with MERT , 2011, Prague Bull. Math. Linguistics.

[3]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[4]  Shuly Wintner,et al.  Adapting Translation Models to Translationese Improves SMT , 2012, EACL.

[5]  Aurko Roy,et al.  Unsupervised Paraphrasing without Translation , 2019, ACL.

[6]  Antonio Toral,et al.  The Effect of Translationese in Machine Translation Test Sets , 2019, WMT.

[7]  Antonio Toral,et al.  Reassessing Claims of Human Parity and Super-Human Performance in Machine Translation at WMT 2019 , 2020, EAMT.

[8]  Mona Baker,et al.  'Corpus Linguistics and Translation Studies: Implications and Applications' , 1993 .

[9]  Gideon Toury,et al.  Descriptive translation studies and beyond , 1995 .

[10]  Cyril Goutte,et al.  Automatic Detection of Translated Text and its Impact on Machine Translation , 2009, MTSUMMIT.

[11]  Tara N. Sainath,et al.  Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling , 2019, ArXiv.

[12]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[13]  Moshe Koppel,et al.  Translationese and Its Dialects , 2011, ACL.

[14]  Philipp Koehn,et al.  Findings of the 2018 Conference on Machine Translation (WMT18) , 2018, WMT.

[15]  Yann Dauphin,et al.  A Convolutional Encoder Model for Neural Machine Translation , 2016, ACL.

[16]  Mirella Lapata,et al.  Paraphrasing Revisited with Neural Machine Translation , 2017, EACL.

[17]  周彬彬,et al.  Interlanguage : forty years later , 2014 .

[18]  Taro Watanabe,et al.  Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection , 2018, WMT.

[19]  Nathan Ng,et al.  Simple and Effective Noisy Channel Modeling for Neural Machine Translation , 2019, EMNLP.

[20]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[21]  Nitin Madnani,et al.  Using Paraphrases for Parameter Tuning in Statistical Machine Translation , 2007, WMT@ACL.

[22]  Andy Way,et al.  Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Machine Translation , 2018, WMT.

[23]  Antonio Toral,et al.  A Set of Recommendations for Assessing Human-Machine Parity in Language Translation , 2020, J. Artif. Intell. Res..

[24]  Karin M. Verspoor,et al.  Findings of the 2016 Conference on Machine Translation , 2016, WMT.

[25]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[26]  Myle Ott,et al.  Facebook FAIR’s WMT19 News Translation Task Submission , 2019, WMT.

[27]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[28]  Lucia Specia,et al.  Reference Bias in Monolingual Machine Translation Evaluation , 2016, ACL.

[29]  Markus Freitag,et al.  Translationese as a Language in “Multilingual” NMT , 2019, ACL.

[30]  Miles Osborne,et al.  Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[31]  Yifan He,et al.  Metric and reference factors in minimum error rate training , 2010, Machine Translation.

[32]  Chris Callison-Burch,et al.  Improved Statistical Machine Translation Using Monolingually-Derived Paraphrases , 2009, EMNLP.

[33]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[34]  Markus Freitag,et al.  BLEU Might Be Guilty but References Are Not Innocent , 2020, EMNLP.

[35]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[36]  Sara Stymne,et al.  Improving Alignment for SMT by Reordering and Augmenting the Training Corpus , 2009, WMT@EACL.

[37]  Markus Freitag,et al.  APE at Scale and Its Implications on MT Evaluation Biases , 2019, WMT.

[38]  Sara Stymne,et al.  The Effect of Translationese on Tuning for Statistical Machine Translation , 2017, NODALIDA.

[39]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[40]  Stefan Riezler,et al.  On Some Pitfalls in Automatic Evaluation and Significance Testing for MT , 2005, IEEvaluation@ACL.

[41]  Pekka Kujamäki,et al.  Translation universals: do they exist? , 2004 .

[42]  Philipp Koehn,et al.  Translationese in Machine Translation Evaluation , 2019, EMNLP.

[43]  Rico Sennrich,et al.  Domain, Translationese and Noise in Synthetic Data for Neural Machine Translation , 2019, ArXiv.

[44]  Timothy Baldwin,et al.  Further Investigation into Reference Bias in Monolingual Evaluation of Machine Translation , 2017, EMNLP.

[45]  Nikhil Buduma,et al.  Fundamentals of deep learning , 2017 .

[46]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[47]  Shuly Wintner,et al.  Language Models for Machine Translation: Original vs. Translated Texts , 2011, CL.