Is Machine Translation Getting Better over Time?

Recent human evaluation of machine translation has focused on relative preference judgments of translation quality, making it difficult to track longitudinal improvements over time. We carry out a large-scale crowd-sourcing experiment to estimate the degree to which state-of-theart performance in machine translation has increased over the past five years. To facilitate longitudinal evaluation, we move away from relative preference judgments and instead ask human judges to provide direct estimates of the quality of individual translations in isolation from alternate outputs. For seven European language pairs, our evaluation estimates an average 10-point improvement to state-of-theart machine translation between 2007 and 2012, with Czech-to-English translation standing out as the language pair achieving most substantial gains. Our method of human evaluation offers an economically feasible and robust means of performing ongoing longitudinal evaluation of machine translation.

[1]  R. Seymour,et al.  An evaluation of length and end-phrase of visual analogue scales in dental pain , 1985, Pain.

[2]  D. Streiner,et al.  Health Measurement Scales: A practical guide to thier development and use , 1989 .

[3]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[4]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[5]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[6]  Philipp Koehn,et al.  (Meta-) Evaluation of Machine Translation , 2007, WMT@ACL.

[7]  Ron Artstein,et al.  Survey Article: Inter-Coder Agreement for Computational Linguistics , 2008, CL.

[8]  A. Alexandrov Characteristics of Single-Item Measures in Likert Scale Format , 2010 .

[9]  Dekai Wu,et al.  MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles , 2011, ACL.

[10]  Ondrej Bojar,et al.  A Grain of Salt for the WMT Manual Evaluation , 2011, WMT@EMNLP.

[11]  Daniel Marcu,et al.  HyTER: Meaning-Equivalent Semantics for Translation Evaluation , 2012, NAACL.

[12]  Philipp Koehn,et al.  Findings of the 2012 Workshop on Statistical Machine Translation , 2012, WMT@NAACL-HLT.

[13]  Philipp Koehn,et al.  The Feasibility of HMEANT as a Human MT Evaluation Metric , 2013, WMT@ACL.

[14]  Philipp Koehn,et al.  Findings of the 2013 Workshop on Statistical Machine Translation , 2013, WMT@ACL.

[15]  Timothy Baldwin,et al.  Continuous Measurement Scales in Human Evaluation of Machine Translation , 2013, LAW@ACL.

[16]  Ralf Dresner,et al.  Health Measurement Scales A Practical Guide To Their Development And Use , 2016 .