Evaluating Pronominal Anaphora in Machine Translation: An Evaluation Measure and a Test Suite

The ongoing neural revolution in machine translation has made it easier to model larger contexts beyond the sentence-level, which can potentially help resolve some discourse-level ambiguities such as pronominal anaphora, thus enabling better translations. Unfortunately, even when the resulting improvements are seen as substantial by humans, they remain virtually unnoticed by traditional automatic evaluation measures like BLEU, as only a few words end up being affected. Thus, specialized evaluation measures are needed. With this aim in mind, we contribute an extensive, targeted dataset that can be used as a test suite for pronoun translation, covering multiple source languages and different pronoun errors drawn from real system translations, for English. We further propose an evaluation measure to differentiate good and bad pronoun translations. We also conduct a user study to report correlations with human judgments.

[1]  Hwee Tou Ng,et al.  Word Sense Disambiguation Improves Statistical Machine Translation , 2007, ACL.

[2]  Luke S. Zettlemoyer,et al.  Dissecting Contextual Word Embeddings: Architecture and Representation , 2018, EMNLP.

[3]  Kevin Duh,et al.  Ranking vs. Regression in Machine Translation Evaluation , 2008, WMT@ACL.

[4]  Marine Carpuat,et al.  Improving Statistical Machine Translation Using Word Sense Disambiguation , 2007, EMNLP.

[5]  K. Gwet Computing inter-rater reliability and its variance in the presence of high agreement. , 2008, The British journal of mathematical and statistical psychology.

[6]  Philipp Koehn,et al.  Findings of the 2011 Workshop on Statistical Machine Translation , 2011, WMT@EMNLP.

[7]  Rico Sennrich,et al.  The Word Sense Disambiguation Test Suite at WMT18 , 2018, WMT.

[8]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[9]  Jörg Tiedemann,et al.  Document-Wide Decoding for Phrase-Based Statistical Machine Translation , 2012, EMNLP.

[10]  Preslav Nakov,et al.  Findings of the 2016 WMT Shared Task on Cross-lingual Pronoun Prediction , 2016, WMT.

[11]  Philipp Koehn,et al.  Findings of the 2017 Conference on Machine Translation (WMT17) , 2017, WMT.

[12]  Noah A. Smith,et al.  A Simple, Fast, and Effective Reparameterization of IBM Model 2 , 2013, NAACL.

[13]  Preslav Nakov,et al.  Pairwise Neural Machine Translation Evaluation , 2015, ACL.

[14]  Liane Guillou,et al.  PROTEST: A Test Suite for Evaluating Pronouns in Machine Translation , 2016, LREC.

[15]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[16]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[17]  Ehud Reiter,et al.  A Structured Review of the Validity of BLEU , 2018, CL.

[18]  Philipp Koehn,et al.  Findings of the 2013 Workshop on Statistical Machine Translation , 2013, WMT@ACL.

[19]  Philipp Koehn,et al.  Findings of the 2012 Workshop on Statistical Machine Translation , 2012, WMT@NAACL-HLT.

[20]  Philipp Koehn,et al.  Findings of the 2014 Workshop on Statistical Machine Translation , 2014, WMT@ACL.

[21]  David Chiang,et al.  A Hierarchical Phrase-Based Model for Statistical Machine Translation , 2005, ACL.

[22]  Sharid Loáiciga,et al.  A Pronoun Test Suite Evaluation of the English–German MT Systems at WMT 2018 , 2018, WMT.

[23]  Rico Sennrich,et al.  Context-Aware Neural Machine Translation Learns Anaphora Resolution , 2018, ACL.

[24]  Andrei Popescu-Belis,et al.  Validation of an Automatic Metric for the Accuracy of Pronoun Translation (APT) , 2017, DiscoMT@EMNLP.

[25]  Lijun Wu,et al.  Achieving Human Parity on Automatic Chinese to English News Translation , 2018, ArXiv.

[26]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[27]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[28]  Rico Sennrich,et al.  A Large-Scale Test Set for the Evaluation of Context-Aware Pronoun Translation in Neural Machine Translation , 2018, WMT.

[29]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[30]  Liane Guillou,et al.  Automatic Reference-Based Evaluation of Pronoun Translation Misses the Point , 2018, EMNLP.

[31]  Preslav Nakov,et al.  Findings of the 2017 DiscoMT Shared Task on Cross-lingual Pronoun Prediction , 2017, DiscoMT@EMNLP.

[32]  Daniel Marcu,et al.  What’s in a translation rule? , 2004, NAACL.

[33]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[34]  Marcello Federico,et al.  Modelling pronominal anaphora in statistical machine translation , 2010, IWSLT.

[35]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[36]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[37]  Preslav Nakov,et al.  Pronoun-Focused MT and Cross-Lingual Pronoun Prediction: Findings of the 2015 DiscoMT Shared Task on Pronoun Translation , 2015, DiscoMT@EMNLP.

[38]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[39]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[40]  Karin M. Verspoor,et al.  Findings of the 2016 Conference on Machine Translation , 2016, WMT.

[41]  Rico Sennrich,et al.  Evaluating Discourse Phenomena in Neural Machine Translation , 2017, NAACL.

[42]  Christian Hardmeier,et al.  Discourse in Statistical Machine Translation , 2014 .

[43]  Rico Sennrich,et al.  Has Machine Translation Achieved Human Parity? A Case for Document-level Evaluation , 2018, EMNLP.

[44]  Preslav Nakov,et al.  Machine Translation Evaluation with Neural Networks , 2017, Comput. Speech Lang..