Neural versus phrase-based MT quality: An in-depth analysis on English-German and English-French

Abstract Within the field of statistical machine translation, the neural approach (NMT) is currently pushing ahead the state of the art performance traditionally achieved by phrase-based approaches (PBMT), and is rapidly becoming the dominant technology in machine translation. Indeed, in the last IWSLT and WMT evaluation campaigns on machine translation, NMT outperformed well established state-of-the-art PBMT systems on many different language pairs. To understand in what respects NMT provides better translation quality than PBMT, we perform a detailed analysis of neural versus phrase-based statistical machine translation outputs, leveraging high quality post-edits performed by professional translators on the IWSLT data. In this analysis, we focus on two language directions with different characteristics: English–German, known to be particularly hard because of morphology and syntactic differences, and English–French, where PBMT systems typically reach outstanding quality and thus represent a strong competitor for NMT. Our analysis provides useful insights on what linguistic phenomena are best modelled by neural models – such as the reordering of verbs and nouns – while pointing out other aspects that remain to be improved – like the correct translation of proper nouns.

[1]  Antonio Toral,et al.  Fine-Grained Human Evaluation of Neural Versus Phrase-Based Machine Translation , 2017, Prague Bull. Math. Linguistics.

[2]  Yoshua Bengio,et al.  On Using Monolingual Corpora in Neural Machine Translation , 2015, ArXiv.

[3]  Yoshua Bengio,et al.  Montreal Neural Machine Translation Systems for WMT’15 , 2015, WMT@EMNLP.

[4]  Yoshua Bengio,et al.  Overcoming the Curse of Sentence Length for Neural Machine Translation using Automatic Segmentation , 2014, SSST@EMNLP.

[5]  Scott M. Smith,et al.  Computer Intensive Methods for Testing Hypotheses: An Introduction , 1989 .

[6]  Satoshi Nakamura,et al.  Neural Reranking Improves Subjective Quality of Machine Translation: NAIST at WAT2015 , 2015, WAT.

[7]  Alexandra Birch,et al.  The Edinburgh Machine Translation Systems for IWSLT 2015 , 2015 .

[8]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[9]  Rico Sennrich,et al.  How Grammatical is Character-level Neural Machine Translation? Assessing MT Quality with Contrastive Translation Pairs , 2016, EACL.

[10]  Yoshua Bengio,et al.  On Using Very Large Target Vocabulary for Neural Machine Translation , 2014, ACL.

[11]  Michael J. Burke,et al.  Averaging Correlations: Expected Values and Bias in Combined Pearson rs and Fisher's z Transformations , 1998 .

[12]  Quoc V. Le,et al.  Addressing the Rare Word Problem in Neural Machine Translation , 2014, ACL.

[13]  Ondrej Bojar,et al.  Addicter: What Is Wrong with My Translations? , 2011, Prague Bull. Math. Linguistics.

[14]  Ondrej Bojar,et al.  Analyzing Error Types in English-Czech Machine Translation , 2011, Prague Bull. Math. Linguistics.

[15]  Maja Popovic Hjerson: An Open Source Tool for Automatic Error Classification of Machine Translation Output , 2011, Prague Bull. Math. Linguistics.

[16]  José A. R. Fonollosa,et al.  Linguistic-based Evaluation Criteria to identify Statistical Machine Translation Errors , 2010, EAMT.

[17]  Stefan Riezler,et al.  On Some Pitfalls in Automatic Evaluation and Significance Testing for MT , 2005, IEEvaluation@ACL.

[18]  Marcin Junczys-Dowmunt,et al.  Is Neural Machine Translation Ready for Deployment? A Case Study on 30 Translation Directions , 2016, IWSLT.

[19]  Marcin Junczys-Dowmunt,et al.  The University of Edinburgh’s systems submission to the MT task at IWSLT , 2018, IWSLT.

[20]  Mauro Cettolo,et al.  WIT3: Web Inventory of Transcribed and Translated Talks , 2012, EAMT.

[21]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[22]  Hermann Ney,et al.  Error Analysis of Statistical Machine Translation Output , 2006, LREC.

[23]  Philipp Koehn,et al.  Findings of the 2015 Workshop on Statistical Machine Translation , 2015, WMT@EMNLP.

[24]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[25]  Dragos Stefan Munteanu,et al.  Measuring Machine Translation Errors in New Domains , 2013, TACL.

[26]  Ondrej Bojar,et al.  Terra: a Collection of Translation Error-Annotated Corpora , 2012, LREC.

[27]  Hans Uszkoreit,et al.  Learning from human judgments of machine translation output , 2013 .

[28]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[29]  Maarit Koponen,et al.  Comparing human perceptions of post-editing effort with post-editing operations , 2012, WMT@NAACL-HLT.

[30]  Mary A. Flanagan,et al.  Error Classification for MT Evaluation , 1994, AMTA.

[31]  Arianna Bisazza,et al.  Neural versus Phrase-Based Machine Translation Quality: a Case Study , 2016, EMNLP.

[32]  Hermann Ney,et al.  Towards Automatic Error Analysis of Machine Translation Output , 2011, CL.

[33]  Jan Niehues,et al.  The KIT translation systems for IWSLT 2015 , 2015, IWSLT.

[34]  Mauro Cettolo,et al.  The IWSLT 2016 Evaluation Campaign , 2016, IWSLT.

[35]  Krzysztof Marasek,et al.  PJAIT systems for the IWSLT 2015 evaluation campaign enhanced by comparable corpora , 2015, IWSLT.

[36]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[37]  R. Tibshirani,et al.  An introduction to the bootstrap , 1993 .

[38]  Karin M. Verspoor,et al.  Findings of the 2016 Conference on Machine Translation , 2016, WMT.

[39]  Hans Uszkoreit,et al.  Using a new analytic measure for the annotation and analysis of MT errors on real data , 2014, EAMT.

[40]  Deyi Xiong,et al.  Automatic Long Sentence Segmentation for Neural Machine Translation , 2016, NLPCC/ICCPOL.

[41]  Jan Niehues,et al.  The IWSLT 2015 Evaluation Campaign , 2015, IWSLT.

[42]  Gerold Schneider,et al.  Exploiting Synergies Between Open Resources for German Dependency Parsing, POS-tagging, and Morphological Analysis , 2013, RANLP.

[43]  Lynette Hirschman,et al.  Evaluating Message Understanding Systems: An Analysis of the Third Message Understanding Conference (MUC-3) , 1993, CL.

[44]  Arianna Bisazza,et al.  Surveys: A Survey of Word Reordering in Statistical Machine Translation: Computational Models and Language Phenomena , 2015, CL.

[45]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[46]  Marcello Federico,et al.  Assessing the Impact of Translation Errors on Machine Translation Quality with Mixed-effects Models , 2014, EMNLP.

[47]  Stefan Riezler,et al.  The Heidelberg University English-German translation system for IWSLT 2015 , 2015, IWSLT.

[48]  Sara Stymne,et al.  On the practice of error analysis for machine translation evaluation , 2012, LREC.

[49]  Sonia Vandepitte,et al.  On the origin of errors: A fine-grained analysis of MT and PE errors and their relationship , 2014, LREC.

[50]  Dan I. Moldovan,et al.  Semantic Representation of Negation Using Focus Detection , 2011, ACL.

[51]  Marcello Federico,et al.  Complexity of spoken versus written language for machine translation , 2014, EAMT.