Blend: a Novel Combined MT Metric Based on Direct Assessment — CASICT-DCU submission to WMT17 Metrics Task

Existing metrics to evaluate the quality of Machine Translation hypotheses take different perspectives into account. DPMFcomb, a metric combining the merits of a range of metrics, achieved the best performance for evaluation of to-English language pairs in the previous two years of WMT Metrics Shared Tasks. This year, we submit a novel combined metric, Blend, to WMT17 Metrics task. Compared to DPMFcomb, Blend includes the following adaptations: i) We use DA human evaluation to guide the training process with a vast reduction in required training data, while still achieving improved performance when evaluated on WMT16 to-English language pairs; ii) We carry out experiments to explore the contribution of metrics incorporated in Blend, in order to find a trade-off between performance and efficiency.

[1]  Hermann Ney,et al.  Accelerated DP based search for statistical translation , 1997, EUROSPEECH.

[2]  Hermann Ney,et al.  An Evaluation Tool for Machine Translation: Fast Evaluation for MT Research , 2000, LREC.

[3]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[4]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[5]  I. Dan Melamed,et al.  Precision and Recall of Machine Translation , 2003, NAACL.

[6]  Chin-Yew Lin,et al.  Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics , 2004, ACL.

[7]  Ding Liu,et al.  Syntactic Features for Evaluation of Machine Translation , 2005, IEEvaluation@ACL.

[8]  Philipp Koehn,et al.  (Meta-) Evaluation of Machine Translation , 2007, WMT@ACL.

[9]  Andy Way,et al.  Evaluating machine translation with LFG dependencies , 2007, Machine Translation.

[10]  Hwee Tou Ng,et al.  MAXSIM: A Maximum Similarity Metric for Machine Translation Evaluation , 2008, ACL.

[11]  Nitin Madnani,et al.  Fluency, Adequacy, or HTER? Exploring Different Human Judgments with a Tunable MT Metric , 2009, WMT@EACL.

[12]  Lluís Màrquez i Villodre,et al.  Asiya: An Open Toolkit for Automatic Machine Translation (Meta-)Evaluation , 2010, Prague Bull. Math. Linguistics.

[13]  Tiejun Zhao,et al.  All in Strings: a Powerful String-based Automatic MT Evaluation Metric with Multiple Granularities , 2010, COLING.

[14]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[15]  Dekai Wu,et al.  Fully Automatic Semantic MT Evaluation , 2012, WMT@NAACL-HLT.

[16]  Timothy Baldwin,et al.  Continuous Measurement Scales in Human Evaluation of Machine Translation , 2013, LAW@ACL.

[17]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[18]  Preslav Nakov,et al.  Using Discourse Structure Improves Machine Translation Evaluation , 2014, ACL.

[19]  Qun Liu,et al.  RED: A Reference Dependency Based MT Evaluation Metric , 2014, COLING.

[20]  Qun Liu,et al.  Improve the Evaluation of Translation Fluency by Using Entropy of Matched Sub-segments , 2015, ArXiv.

[21]  Philipp Koehn,et al.  Results of the WMT15 Metrics Shared Task , 2015, WMT@EMNLP.

[22]  Ondrej Bojar,et al.  Results of the WMT13 Metrics Shared Task , 2015, WMT@EMNLP.

[23]  Khalil Sima'an,et al.  BEER 1.1: ILLC UvA submission to metrics and tuning task , 2015, WMT@EMNLP.

[24]  Timothy Baldwin,et al.  Accurate Evaluation of Segment-level Machine Translation Metrics , 2015, NAACL.

[25]  Qun Liu,et al.  CASICT-DCU Participation in WMT2015 Metrics Task , 2015, WMT@EMNLP.

[26]  Hermann Ney,et al.  CharacTer: Translation Edit Rate on Character Level , 2016, WMT.

[27]  Qun Liu,et al.  Machine Translation Evaluation Metric Based on Dependency Parsing Model , 2019, ACM Trans. Asian Low Resour. Lang. Inf. Process..