Can Your Context-Aware MT System Pass the DiP Benchmark Tests? : Evaluation Benchmarks for Discourse Phenomena in Machine Translation

Despite increasing instances of machine translation (MT) systems including contextual information, the evidence for translation quality improvement is sparse, especially for discourse phenomena. Popular metrics like BLEU are not expressive or sensitive enough to capture quality improvements or drops that are minor in size but significant in perception. We introduce the first of their kind MT benchmark datasets that aim to track and hail improvements across four main discourse phenomena: anaphora, lexical consistency, coherence and readability, and discourse connective translation. We also introduce evaluation methods for these tasks, and evaluate several baseline MT systems on the curated datasets. Surprisingly, we find that existing context-aware models do not improve discourse-related translations consistently across languages and phenomena.

[1]  Lucia Specia,et al.  A Proposal for a Coherence Corpus in Machine Translation , 2015, DiscoMT@EMNLP.

[2]  Rashmi Prasad,et al.  Reflections on the Penn Discourse TreeBank, Comparable Corpora, and Complementary Annotation , 2014, CL.

[3]  Andrei Popescu-Belis,et al.  Validation of an Automatic Metric for the Accuracy of Pronoun Translation (APT) , 2017, DiscoMT@EMNLP.

[4]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[6]  Liane Guillou,et al.  Improving Pronoun Translation for Statistical Machine Translation , 2012, EACL.

[7]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[8]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[9]  Philipp Koehn,et al.  Findings of the 2014 Workshop on Statistical Machine Translation , 2014, WMT@ACL.

[10]  Swapna Somasundaran,et al.  Lexical Chaining for Measuring Discourse Coherence Quality in Test-taker Essays , 2014, COLING.

[11]  Rico Sennrich,et al.  Evaluating Discourse Phenomena in Neural Machine Translation , 2017, NAACL.

[12]  Marcello Federico,et al.  Modelling pronominal anaphora in statistical machine translation , 2010, IWSLT.

[13]  Rico Sennrich,et al.  When a Good Translation is Wrong in Context: Context-Aware Machine Translation Improves on Deixis, Ellipsis, and Lexical Cohesion , 2019, ACL.

[14]  Shafiq R. Joty,et al.  A Unified Neural Coherence Model , 2019, EMNLP.

[15]  Jörg Tiedemann,et al.  Neural Machine Translation with Extended Context , 2017, DiscoMT@EMNLP.

[16]  Lijun Wu,et al.  Achieving Human Parity on Automatic Chinese to English News Translation , 2018, ArXiv.

[17]  Liane Guillou,et al.  PROTEST: A Test Suite for Evaluating Pronouns in Machine Translation , 2016, LREC.

[18]  Rico Sennrich,et al.  Context-Aware Neural Machine Translation Learns Anaphora Resolution , 2018, ACL.

[19]  Alan Lee,et al.  Discourse Annotation in the PDTB: The Next Generation , 2018, ACL 2018.

[20]  Hermann Ney,et al.  When and Why is Document-level Context Useful in Neural Machine Translation? , 2019, EMNLP.

[21]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[22]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[23]  Gholamreza Haffari,et al.  Document Context Neural Machine Translation with Memory Networks , 2017, ACL.

[24]  Rico Sennrich,et al.  Has Machine Translation Achieved Human Parity? A Case for Document-level Evaluation , 2018, EMNLP.

[25]  Andrei Popescu-Belis,et al.  Assessing the Accuracy of Discourse Connective Translations: Validation of an Automatic Metric , 2013, CICLing.

[26]  Jörg Tiedemann,et al.  ParCor 1.0: A Parallel Pronoun-Coreference Corpus to Support Statistical MT , 2014, LREC.

[27]  Graeme Hirst,et al.  Lexical Cohesion Computed by Thesaural relations as an indicator of the structure of text , 1991, CL.

[28]  Ani Nenkova,et al.  Revisiting Readability: A Unified Framework for Predicting Text Quality , 2008, EMNLP.

[29]  Christian Hardmeier,et al.  ParCorFull: a Parallel Corpus Annotated with Full Coreference , 2018, LREC.

[30]  Preslav Nakov,et al.  Evaluating Pronominal Anaphora in Machine Translation: An Evaluation Measure and a Test Suite , 2019, EMNLP.

[31]  Kazem Lotfipour-Saedi,et al.  Lexical cohesion and translation equivalence , 1997 .

[32]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[33]  Marcin Junczys-Dowmunt,et al.  The United Nations Parallel Corpus v1.0 , 2016, LREC.

[34]  Lucia Specia,et al.  The Trouble with Machine Translation Coherence , 2016, EAMT.

[35]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[36]  Marine Carpuat,et al.  One Translation Per Discourse , 2009, SEW@NAACL-HLT.

[37]  James Henderson,et al.  Document-Level Neural Machine Translation with Hierarchical Attention Networks , 2018, EMNLP.

[38]  Andrei Popescu-Belis,et al.  Machine Translation of Labeled Discourse Connectives , 2012, AMTA.

[39]  Rachel Rudinger,et al.  SenseSpotting: Never let your parallel data tie you to an old domain , 2013, ACL.

[40]  K. Gwet Computing inter-rater reliability and its variance in the presence of high agreement. , 2008, The British journal of mathematical and statistical psychology.

[41]  Gholamreza Haffari,et al.  Selective Attention for Context-aware Neural Machine Translation , 2019, NAACL.

[42]  Liane Guillou,et al.  Automatic Reference-Based Evaluation of Pronoun Translation Misses the Point , 2018, EMNLP.

[43]  Andy Way,et al.  Is Neural Machine Translation the New State of the Art? , 2017, Prague Bull. Math. Linguistics.

[44]  Liane Guillou,et al.  Analysing Lexical Consistency in Translation , 2013, DiscoMT@ACL.

[45]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[46]  Philipp Koehn,et al.  Findings of the 2013 Workshop on Statistical Machine Translation , 2013, WMT@ACL.

[47]  Jörg Tiedemann,et al.  Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation , 2013, ACL.