BEER 1.1: ILLC UvA submission to metrics and tuning task

We describe the submissions of ILLC UvA to the metrics and tuning tasks on WMT15. Both submissions are based on the BEER evaluation metric originally presented on WMT14 (Stanojevic and Sima’an, 2014a). The main changes introduced this year are: (i) extending the learning-to-rank trained sentence level metric to the corpus level (but still decomposable to sentence level), (ii) incorporating syntactic ingredients based on dependency trees, and (iii) a technique for finding parameters of BEER that avoid “gaming of the metric” during tuning.

[1]  Klaus Obermayer,et al.  Support vector learning for ordinal regression , 1999 .

[2]  Hermann Ney,et al.  Syntax-Oriented Evaluation Measures for Machine Translation Output , 2009, WMT@EACL.

[3]  Khalil Sima'an,et al.  Fitting Sentence Level Translation Evaluation with Many Dense Features , 2014, EMNLP.

[4]  Alon Lavie,et al.  Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems , 2011, WMT@EMNLP.

[5]  Qun Liu,et al.  RED: A Reference Dependency Based MT Evaluation Metric , 2014, COLING.

[6]  Ondrej Bojar,et al.  Results of the WMT14 Metrics Shared Task , 2013 .

[7]  Yifan He,et al.  Improving the Objective Function in Minimum Error Rate Training , 2009, MTSUMMIT.

[8]  Khalil Sima'an,et al.  Evaluating Word Order Recursively over Permutation-Forests , 2014, SSST@EMNLP.

[9]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[10]  Milos Stanojevic Removing Biases from Trainable MT Metrics by Using Self-Training , 2015, ArXiv.

[11]  Kevin Duh,et al.  Automatic Evaluation of Translation Quality for Distant Language Pairs , 2010, EMNLP.

[12]  Khalil Sima'an,et al.  BEER: BEtter Evaluation as Ranking , 2014, WMT@ACL.

[13]  Daniel Gildea,et al.  Factorization of Synchronous Context-Free Grammars in Linear Time , 2007, SSST@HLT-NAACL.

[14]  Danqi Chen,et al.  A Fast and Accurate Dependency Parser using Neural Networks , 2014, EMNLP.

[15]  Alexandra Birch,et al.  LRscore for Evaluating Lexical and Reordering Quality in MT , 2010, WMT@ACL.

[16]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.