Are Automatic Metrics Robust and Reliable in Specic Machine Translation Tasks?

The research leading to these results has received funding from the Generalitat Valenciana under grant PROMETEO/2018/004.

[1]  Maja Popovic,et al.  chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.

[2]  Francisco Casacuberta,et al.  Adapting Neural Machine Translation with Parallel Synthetic Data , 2017, WMT.

[3]  Antonio Toral,et al.  A Multifaceted Evaluation of Neural versus Phrase-Based Machine Translation for 9 Language Directions , 2017, EACL.

[4]  Joseph P. Turian,et al.  Evaluation of machine translation and its evaluation , 2003, MTSUMMIT.

[5]  Dietrich Klakow,et al.  Testing the correlation of word error rate and perplexity , 2002, Speech Commun..

[6]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[7]  Khalil Sima'an,et al.  Evaluating Word Order Recursively over Permutation-Forests , 2014, SSST@EMNLP.

[8]  M. Tatsumi Correlation between Automatic Evaluation Metric Scores, Post-Editing Speed, and Some Other Factors , 2009, MTSUMMIT.

[9]  Arianna Bisazza,et al.  Neural versus phrase-based MT quality: An in-depth analysis on English-German and English-French , 2018, Comput. Speech Lang..

[10]  Timothy Baldwin,et al.  Can machine translation systems be evaluated by the crowd alone , 2015, Natural Language Engineering.

[11]  Mauro Cettolo,et al.  Overview of the IWSLT 2017 Evaluation Campaign , 2017, IWSLT.

[12]  Jörg Tiedemann,et al.  Climbing Mont BLEU: The Strange World of Reachable High-BLEU Translations , 2016, EAMT.

[13]  Hermann Ney,et al.  Statistical Approaches to Computer-Assisted Translation , 2009, CL.

[14]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[15]  Khalil Sima'an,et al.  Alternative Objective Functions for Training MT Evaluation Metrics , 2017, ACL.

[16]  Antonio Toral,et al.  Fine-Grained Human Evaluation of Neural Versus Phrase-Based Machine Translation , 2017, Prague Bull. Math. Linguistics.

[17]  Phil D. Green,et al.  From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition , 2004, INTERSPEECH.

[18]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[19]  Quoc V. Le,et al.  Massive Exploration of Neural Machine Translation Architectures , 2017, EMNLP.

[20]  Ondrej Bojar,et al.  Results of the WMT17 Metrics Shared Task , 2017, WMT.

[21]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[22]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[23]  Alon Lavie,et al.  The Meteor metric for automatic evaluation of machine translation , 2009, Machine Translation.

[24]  Hermann Ney,et al.  Accelerated DP based search for statistical translation , 1997, EUROSPEECH.

[25]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[26]  Philipp Koehn,et al.  Six Challenges for Neural Machine Translation , 2017, NMT@ACL.

[27]  Cyril Goutte Automatic Evaluation of Machine Translation Quality , 2006 .

[28]  John S. White,et al.  The ARPA MT Evaluation Methodologies: Evolution, Lessons, and Future Approaches , 1994, AMTA.

[29]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[30]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[31]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[32]  John R. Pierce,et al.  Language and Machines: Computers in Translation and Linguistics , 1966 .

[33]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[34]  Khalil Sima Fitting Sentence Level Translation Evaluation with Many Dense Features , 2014 .