Automatic Machine Translation Evaluation in Many Languages via Zero-Shot Paraphrasing

We propose the use of a sequence-to-sequence paraphraser for automatic machine translation evaluation. The paraphraser takes a human reference as input and then force-decodes and scores an MT system output. We propose training the aforementioned paraphraser as a multilingual NMT system, treating paraphrasing as a zero-shot "language pair" (e.g., Russian to Russian). We denote our paraphraser "unbiased" because the mode of our model's output probability is centered around a copy of the input sequence, which in our case represent the best case scenario where the MT system output matches a human reference. Our method is simple and intuitive, and our single model (trained in 39 languages) outperforms or statistically ties with all prior metrics on the WMT19 segment-level shared metrics task in all languages, excluding Gujarati where the model had no training data. We also explore using our model conditioned on the source instead of the reference, and find that it outperforms every quality estimation as a metric system from the WMT19 shared task on quality estimation by a statistically significant margin in every language pair.

[1]  Daniel Marcu,et al.  HyTER: Meaning-Equivalent Semantics for Translation Evaluation , 2012, NAACL.

[2]  Hermann Ney,et al.  CDER: Efficient MT Evaluation Using Block Movements , 2006, EACL.

[3]  Josef van Genabith,et al.  ReVal: A Simple and Effective Machine Translation Evaluation Metric Based on Recurrent Neural Networks , 2015, EMNLP.

[4]  Lijun Wu,et al.  Achieving Human Parity on Automatic Chinese to English News Translation , 2018, ArXiv.

[5]  Pushpak Bhattacharyya,et al.  Machine Translation Evaluation using Bi-directional Entailment , 2019, ArXiv.

[6]  Chris Callison-Burch,et al.  Paraphrasing with Bilingual Parallel Corpora , 2005, ACL.

[7]  Orhan Firat,et al.  Massively Multilingual Neural Machine Translation , 2019, NAACL.

[8]  Zhen-Hua Ling,et al.  Enhanced LSTM for Natural Language Inference , 2016, ACL.

[9]  Khalil Sima'an,et al.  BEER 1.1: ILLC UvA submission to metrics and tuning task , 2015, WMT@EMNLP.

[10]  Maja Popovic,et al.  chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.

[11]  Marcin Junczys-Dowmunt,et al.  Dual Conditional Cross-Entropy Filtering of Noisy Parallel Corpora , 2018, WMT.

[12]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[13]  Oren Etzioni,et al.  Paraphrase-Driven Learning for Open Question Answering , 2013, ACL.

[14]  Wolfgang Menzel,et al.  UHH Submission to the WMT17 Metrics Shared Task , 2017, WMT.

[15]  Lucia Specia,et al.  Reference Bias in Monolingual Machine Translation Evaluation , 2016, ACL.

[16]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[17]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[18]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[19]  Ankur Bapna,et al.  Investigating Multilingual NMT Representations at Scale , 2019, EMNLP.

[20]  Chris Callison-Burch,et al.  The Multilingual Paraphrase Database , 2014, LREC.

[21]  André F. T. Martins,et al.  Findings of the WMT 2019 Shared Tasks on Quality Estimation , 2019, WMT.

[22]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[23]  Timothy Baldwin,et al.  Further Investigation into Reference Bias in Monolingual Evaluation of Machine Translation , 2017, EMNLP.

[24]  Matt Post,et al.  Large-Scale, Diverse, Paraphrastic Bitexts via Sampling and Clustering , 2019, CoNLL.

[25]  Nitika Mathur,et al.  Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics , 2020, ACL.

[26]  Quoc V. Le,et al.  GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.

[27]  Chi-kiu Lo,et al.  YiSi - a Unified Semantic MT Quality Evaluation and Estimation Metric for Languages with Different Levels of Available Resources , 2019, WMT.

[28]  Huda Khayrallah,et al.  On the Impact of Various Types of Noise on Neural Machine Translation , 2018, NMT@ACL.

[29]  Marc'Aurelio Ranzato,et al.  Analyzing Uncertainty in Neural Machine Translation , 2018, ICML.

[30]  Hermann Ney,et al.  CharacTer: Translation Edit Rate on Character Level , 2016, WMT.

[31]  Ondrej Bojar,et al.  Results of the WMT18 Metrics Shared Task: Both characters and embeddings achieve good performance , 2018, WMT.

[32]  Kevin Gimpel,et al.  Learning Paraphrastic Sentence Embeddings from Back-Translated Bitext , 2017, EMNLP.

[33]  Alex Waibel,et al.  Improving Zero-shot Translation with Language-Independent Constraints , 2019, WMT.

[34]  Xiaodong Zeng,et al.  Language-independent Model for Machine Translation Evaluation with Reinforced Factors , 2013, MTSUMMIT.

[35]  Chris Quirk,et al.  Monolingual Machine Translation for Paraphrase Generation , 2004, EMNLP.

[36]  Eduard H. Hovy,et al.  Squibs: What Is a Paraphrase? , 2013, CL.

[37]  Christian Federmann,et al.  Multilingual Whispers: Generating Paraphrases with Translation , 2019, W-NUT@EMNLP.

[38]  Hermann Ney,et al.  EED: Extended Edit Distance Measure for Machine Translation , 2019, WMT.

[39]  Huda Khayrallah,et al.  Improved Lexically Constrained Decoding for Translation and Monolingual Rewriting , 2019, NAACL.

[40]  Philipp Koehn,et al.  Low-Resource Corpus Filtering Using Multilingual Sentence Embeddings , 2019, WMT.

[41]  Yves Scherrer,et al.  Measuring Semantic Abstraction of Multilingual NMT with Paraphrase Recognition and Generation Tasks , 2018, Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for.

[42]  Marjan Ghazvininejad,et al.  Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[43]  Holger Schwenk,et al.  Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond , 2018, Transactions of the Association for Computational Linguistics.

[44]  Timothy Baldwin,et al.  Putting Evaluation in Context: Contextual Embeddings Improve Machine Translation Evaluation , 2019, ACL.

[45]  Dragos Stefan Munteanu,et al.  ParaEval: Using Paraphrases to Evaluate Summaries Automatically , 2006, NAACL.

[46]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[47]  Junfeng Hu,et al.  Meteor++ 2.0: Adopt Syntactic Level Paraphrase Knowledge into Machine Translation Evaluation , 2019, WMT.

[48]  Myle Ott,et al.  On The Evaluation of Machine Translation SystemsTrained With Back-Translation , 2019, ACL.

[49]  Mamoru Komachi,et al.  RUSE: Regressor Using Sentence Embeddings for Automatic Machine Translation Evaluation , 2018, WMT.

[50]  Graham Neubig,et al.  Rapid Adaptation of Neural Machine Translation to New Languages , 2018, EMNLP.

[51]  Mamoru Komachi,et al.  Filtering Pseudo-References by Paraphrasing for Automatic Evaluation of Machine Translation , 2019, WMT.

[52]  Dianhai Yu,et al.  Multi-Task Learning for Multiple Language Translation , 2015, ACL.

[53]  Lucia Specia,et al.  WMDO: Fluency-based Word Mover’s Distance for Machine Translation Evaluation , 2019, WMT.

[54]  Alon Lavie,et al.  Extending the METEOR Machine Translation Evaluation Metric to the Phrase Level , 2010, NAACL.

[55]  Chi-kiu Lo,et al.  MEANT 2.0: Accurate semantic MT evaluation for any output language , 2017, WMT.

[56]  Deniz Yuret,et al.  Transfer Learning for Low-Resource Neural Machine Translation , 2016, EMNLP.

[57]  Holger Schwenk,et al.  Filtering and Mining Parallel Data in a Joint Multilingual Space , 2018, ACL.

[58]  Philipp Koehn,et al.  Findings of the 2018 Conference on Machine Translation (WMT18) , 2018, WMT.

[59]  David Chiang,et al.  Transfer Learning across Low-Resource, Related Languages for Neural Machine Translation , 2017, IJCNLP.

[60]  Thibault Sellam,et al.  BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.

[61]  Timothy Baldwin,et al.  Randomized Significance Tests in Machine Translation , 2014, WMT@ACL.

[62]  Holger Schwenk,et al.  WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia , 2019, EACL.

[63]  Rico Sennrich,et al.  Has Machine Translation Achieved Human Parity? A Case for Document-level Evaluation , 2018, EMNLP.

[64]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[65]  Mark Fishel,et al.  Quality Estimation and Translation Metrics via Pre-trained Word and Sentence Embeddings , 2019, WMT.

[66]  Lucia Specia,et al.  deepQuest: A Framework for Neural-based Quality Estimation , 2018, COLING.

[67]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[68]  Victor O. K. Li,et al.  Universal Neural Machine Translation for Extremely Low Resource Languages , 2018, NAACL.

[69]  Dekai Wu,et al.  MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles , 2011, ACL.

[70]  Qun Liu,et al.  Blend: a Novel Combined MT Metric Based on Direct Assessment — CASICT-DCU submission to WMT17 Metrics Task , 2017, WMT.

[71]  Matthijs Douze,et al.  Learning Joint Multilingual Sentence Representations with Neural Machine Translation , 2017, Rep4NLP@ACL.

[72]  Jörg Tiedemann,et al.  An Evaluation of Language-Agnostic Inner-Attention-Based Representations in Machine Translation , 2019, RepL4NLP@ACL.

[73]  Kevin Gimpel,et al.  Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations , 2017, ArXiv.

[74]  Chris Callison-Burch,et al.  PPDB: The Paraphrase Database , 2013, NAACL.

[75]  Andreas Eisele,et al.  MultiUN: A Multilingual Corpus from United Nation Documents , 2010, LREC.

[76]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[77]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[78]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[79]  Sudip Kumar Naskar,et al.  ITER: Improving Translation Edit Rate through Optimizable Edit Costs , 2018, WMT.

[80]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[81]  Oladimeji Farri,et al.  Neural Paraphrase Generation with Stacked Residual LSTM Networks , 2016, COLING.

[82]  Ondrej Bojar,et al.  Results of the WMT19 Metrics Shared Task: Segment-Level and Strong MT Systems Pose Big Challenges , 2019, WMT.

[83]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[84]  Maja Popovic,et al.  chrF++: words helping character n-grams , 2017, WMT.

[85]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[86]  Matt Post,et al.  ParaBank: Monolingual Bitext Generation and Sentential Paraphrasing via Lexically-constrained Neural Machine Translation , 2019, AAAI.

[87]  Graham Neubig,et al.  Simple and Effective Paraphrastic Similarity from Parallel Translations , 2019, ACL.

[88]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[89]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[90]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[91]  David Kauchak,et al.  Simple English Wikipedia: A New Text Simplification Task , 2011, ACL.

[92]  Lidia S. Chao,et al.  LEPOR: A Robust Evaluation Metric for Machine Translation with Augmented Factors , 2012, COLING.

[93]  Daniel Jurafsky,et al.  Robust Machine Translation Evaluation with Entailment Features , 2009, ACL.

[94]  Eleftherios Avramidis,et al.  Evaluation without references: IBM1 scores as evaluation metrics , 2011, WMT@EMNLP.

[95]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.