Improving Evaluation of Document-level Machine Translation Quality Estimation

© 2017 Association for Computational Linguistics. Meaningful conclusions about the relative performance of NLP systems are only possible if the gold standard employed in a given evaluation is both valid and reliable. In this paper, we explore the validity of human annotations currently employed in the evaluation of document-level quality estimation for machine translation (MT).We demonstrate the degree to which MT system rankings are dependent on weights employed in the construction of the gold standard, before proposing direct human assessment as a valid alternative. Experiments show direct assessment (DA) scores for documents to be highly reliable, achieving a correlation of above 0.9 in a self-replication experiment, in addition to a substantial estimated cost reduction through quality controlled crowdsourcing. The original gold standard based on post-edits incurs a 10-20 times greater cost than DA.

[1]  Radu Soricut,et al.  TrustRank: Inducing Trust in Automatic Translations via Ranking , 2010, ACL.

[2]  Timothy Baldwin,et al.  Can machine translation systems be evaluated by the crowd alone , 2015, Natural Language Engineering.

[3]  Timothy Baldwin,et al.  Continuous Measurement Scales in Human Evaluation of Machine Translation , 2013, LAW@ACL.

[4]  Philipp Koehn,et al.  (Meta-) Evaluation of Machine Translation , 2007, WMT@ACL.

[5]  S. Lewis,et al.  Regression analysis , 2007, Practical Neurology.

[6]  Yvette Graham,et al.  Improving Evaluation of Machine Translation Quality Estimation , 2015, ACL.

[7]  Timothy Baldwin,et al.  Testing for Significance of Increased Correlation with Human Judgment , 2014, EMNLP.

[8]  Philipp Koehn,et al.  Findings of the 2015 Workshop on Statistical Machine Translation , 2015, WMT@EMNLP.

[9]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[10]  Timothy Baldwin,et al.  Accurate Evaluation of Segment-level Machine Translation Metrics , 2015, NAACL.

[11]  Karin M. Verspoor,et al.  Findings of the 2016 Conference on Machine Translation , 2016, WMT.

[12]  Alon Lavie,et al.  Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems , 2011, WMT@EMNLP.

[13]  Timothy Baldwin,et al.  Is all that Glitters in Machine Translation Quality Estimation really Gold? , 2016, COLING.

[14]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[15]  Timothy Baldwin,et al.  Is Machine Translation Getting Better over Time? , 2014, EACL.