Results of the WMT17 Metrics Shared Task
暂无分享,去创建一个
This paper presents the results of the
WMT17 Metrics Shared Task. We asked
participants of this task to score the outputs of the MT systems involved in the
WMT17 news translation task and Neural MT training task. We collected scores
of 14 metrics from 8 research groups. In
addition to that, we computed scores of
7 standard metrics (BLEU, SentBLEU,
NIST, WER, PER, TER and CDER) as
baselines. The collected scores were evaluated in terms of system-level correlation
(how well each metric’s scores correlate
with WMT17 official manual ranking of
systems) and in terms of segment level
correlation (how often a metric agrees with
humans in judging the quality of a particular sentence).
This year, we build upon two types of
manual judgements: direct assessment
(DA) and HUME manual semantic judgements.