Results of the WMT18 Metrics Shared Task: Both characters and embeddings achieve good performance

This paper presents the results of the WMT18 Metrics Shared Task. We asked participants of this task to score the outputs of the MT systems involved in the WMT18 News Translation Task with automatic metrics. We collected scores of 10 metrics and 8 research groups. In addition to that, we computed scores of 8 standard metrics (BLEU, SentBLEU, chrF, NIST, WER, PER, TER and CDER) as baselines. The collected scores were evaluated in terms of system-level correlation (how well each metric’s scores correlate with WMT18 official manual ranking of systems) and in terms of segment-level correlation (how often a metric agrees with humans in judging the quality of a particular sentence relative to alternate outputs). This year, we employ a single kind of manual evaluation: direct assessment (DA).

[1]  Timothy Baldwin,et al.  Is Machine Translation Getting Better over Time? , 2014, EACL.

[2]  Philipp Koehn,et al.  Results of the WMT15 Metrics Shared Task , 2015, WMT@EMNLP.

[3]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[4]  Qun Liu,et al.  Achieving Accurate Conclusions in Evaluation of Automatic Machine Translation Metrics , 2016, NAACL.

[5]  Ondrej Bojar,et al.  Results of the WMT14 Metrics Shared Task , 2013 .

[6]  Philipp Koehn,et al.  Ten Years of WMT Evaluation Campaigns: Lessons Learnt , 2016 .

[7]  Timothy Baldwin,et al.  Continuous Measurement Scales in Human Evaluation of Machine Translation , 2013, LAW@ACL.

[8]  Qun Liu,et al.  Blend: a Novel Combined MT Metric Based on Direct Assessment — CASICT-DCU submission to WMT17 Metrics Task , 2017, WMT.

[9]  Ondrej Bojar,et al.  Results of the WMT17 Metrics Shared Task , 2017, WMT.

[10]  Maja Popovic,et al.  chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.

[11]  Hermann Ney,et al.  CDER: Efficient MT Evaluation Using Block Movements , 2006, EACL.

[12]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[13]  Cyril Goutte Automatic Evaluation of Machine Translation Quality , 2006 .

[14]  Ondrej Bojar,et al.  A Grain of Salt for the WMT Manual Evaluation , 2011, WMT@EMNLP.

[15]  Ondrej Bojar,et al.  EvalD Reference-Less Discourse Evaluation for WMT18 , 2018, WMT.

[16]  Timothy Baldwin,et al.  Can machine translation systems be evaluated by the crowd alone , 2015, Natural Language Engineering.

[17]  Sudip Kumar Naskar,et al.  ITER: Improving Translation Edit Rate through Optimizable Edit Costs , 2018, WMT.

[18]  Timothy Baldwin,et al.  Testing for Significance of Increased Correlation with Human Judgment , 2014, EMNLP.

[19]  Michel Simard,et al.  Alibaba Submission to the WMT18 Parallel Corpus Filtering Task , 2018, WMT.

[20]  Chi-kiu Lo,et al.  MEANT 2.0: Accurate semantic MT evaluation for any output language , 2017, WMT.

[21]  Wolfgang Menzel,et al.  UHH Submission to the WMT17 Metrics Shared Task , 2017, WMT.

[22]  Hermann Ney,et al.  CharacTer: Translation Edit Rate on Character Level , 2016, WMT.