Explicit Representation of the Translation Space: Automatic Paraphrasing for Machine Translation Evaluation

Following previous work on automatic paraphrasing, we assess the feasibility of improving BLEU (Papineni et al., 2002) using state-of-the-art neural paraphrasing techniques to generate additional references. We explore the extent to which diverse paraphrases can adequately cover the space of valid translations and compare to an alternative approach of generating paraphrases constrained by MT outputs. We compare both approaches to human-produced references in terms of diversity and the improvement in BLEU's correlation with human judgements of MT quality. Our experiments on the WMT19 metrics tasks for all into-English language directions show that somewhat surprisingly, the addition of diverse paraphrases, even those produced by humans, leads to only small, inconsistent changes in BLEU's correlation with human judgments, suggesting that BLEU's ability to correctly exploit multiple references is limited

[1]  Mirella Lapata,et al.  Paraphrasing Revisited with Neural Machine Translation , 2017, EACL.

[2]  Huda Khayrallah,et al.  Improved Lexically Constrained Decoding for Translation and Monolingual Rewriting , 2019, NAACL.

[3]  Dan Klein,et al.  Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.

[4]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[5]  Oladimeji Farri,et al.  Neural Paraphrase Generation with Stacked Residual LSTM Networks , 2016, COLING.

[6]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[7]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[8]  Regina Barzilay,et al.  Paraphrasing for Automatic Evaluation , 2006, NAACL.

[9]  Kyunghyun Cho,et al.  Generating Diverse Translations with Sentence Codes , 2019, ACL.

[10]  Ondrej Bojar,et al.  Results of the WMT19 Metrics Shared Task: Segment-Level and Strong MT Systems Pose Big Challenges , 2019, WMT.

[11]  Luke S. Zettlemoyer,et al.  Adversarial Example Generation with Syntactically Controlled Paraphrase Networks , 2018, NAACL.

[12]  Nitin Madnani,et al.  Using Paraphrases for Parameter Tuning in Statistical Machine Translation , 2007, WMT@ACL.

[13]  S. Lewis,et al.  Regression analysis , 2007, Practical Neurology.

[14]  Claire Gardent,et al.  Generating Syntactic Paraphrases , 2018, EMNLP.

[15]  Eiichiro Sumita,et al.  How Does Automatic Machine Translation Evaluation Correlate with Human Scoring as the Number of Reference Translations Increases? , 2004, LREC.

[16]  Holger Schwenk,et al.  Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond , 2018, Transactions of the Association for Computational Linguistics.

[17]  Matt Post,et al.  The Sockeye Neural Machine Translation Toolkit at AMTA 2018 , 2018, AMTA.

[18]  Matt Post,et al.  Large-Scale, Diverse, Paraphrastic Bitexts via Sampling and Clustering , 2019, CoNLL.

[19]  Alessandro Moschitti,et al.  Making Tree Kernels Practical for Natural Language Learning , 2006, EACL.

[20]  Samy Bengio,et al.  Discrete Autoencoders for Sequence Models , 2018, ArXiv.

[21]  Nitika Mathur,et al.  Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics , 2020, ACL.

[22]  Timothy Baldwin,et al.  Continuous Measurement Scales in Human Evaluation of Machine Translation , 2013, LAW@ACL.

[23]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[24]  N. Draper,et al.  Applied Regression Analysis , 1966 .

[25]  Timothy Baldwin,et al.  Testing for Significance of Increased Correlation with Human Judgment , 2014, EMNLP.

[26]  Alon Lavie,et al.  Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems , 2011, WMT@EMNLP.

[27]  Marta R. Costa-jussà,et al.  Findings of the 2019 Conference on Machine Translation (WMT19) , 2019, WMT.

[28]  Matt Post,et al.  Fast Lexically Constrained Decoding with Dynamic Beam Allocation for Neural Machine Translation , 2018, NAACL.

[29]  Mamoru Komachi,et al.  Filtering Pseudo-References by Paraphrasing for Automatic Evaluation of Machine Translation , 2019, WMT.

[30]  Kevin Gimpel,et al.  Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations , 2017, ArXiv.

[31]  Kenneth Heafield,et al.  The Sockeye 2 Neural Machine Translation Toolkit at AMTA 2020 , 2020, AMTA.

[32]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[33]  Myle Ott,et al.  Understanding Back-Translation at Scale , 2018, EMNLP.

[34]  Markus Freitag,et al.  BLEU Might Be Guilty but References Are Not Innocent , 2020, EMNLP.

[35]  Liang Zhou,et al.  Re-evaluating Machine Translation Results with Paraphrase Support , 2006, EMNLP.

[36]  Richard Nock,et al.  D-PAGE: Diverse Paraphrase Generation , 2018, ArXiv.

[37]  Matt Post,et al.  Sentential Paraphrasing as Black-Box Machine Translation , 2016, NAACL.

[38]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[39]  Marcin Junczys-Dowmunt,et al.  Marian: Cost-effective High-Quality Neural Machine Translation in C++ , 2018, NMT@ACL.

[40]  Christian Federmann,et al.  Multilingual Whispers: Generating Paraphrases with Translation , 2019, W-NUT@EMNLP.

[41]  Rebecca Hwa,et al.  The Role of Pseudo References in MT Evaluation , 2008, WMT@ACL.

[42]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[43]  Daniel Marcu,et al.  HyTER: Meaning-Equivalent Semantics for Translation Evaluation , 2012, NAACL.

[44]  Taku Kudo,et al.  Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates , 2018, ACL.

[45]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[46]  Chi-Man Pun,et al.  Adversarial Example Generation , 2019, ArXiv.

[47]  Chris Callison-Burch,et al.  Automated Paraphrase Lattice Creation for HyTER Machine Translation Evaluation , 2018, NAACL.