论文信息 - LAYERED: Metric for Machine Translation Evaluation

LAYERED: Metric for Machine Translation Evaluation

This paper describes the LAYERED metric which is used for the shared WMT’14 metrics task. Various metrics exist for MT evaluation: BLEU (Papineni, 2002), METEOR (Alon Lavie, 2007), TER (Snover, 2006) etc., but are found inadequate in quite a few language settings like, for example, in case of free word order languages. In this paper, we propose an MT evaluation scheme that is based on the NLP layers: lexical, syntactic and semantic. We contend that higher layer metrics are after all needed. Results are presented on the corpora of ACL-WMT, 2013 and 2014. We end with a metric which is composed of weighted metrics at individual layers, which correlates very well with human judgment.

Pushpak Bhattacharyya | Shubham Gautam | P. Bhattacharyya | Shubham Gautam

[1] Andy Way,et al. Evaluating machine translation with LFG dependencies , 2007, Machine Translation.

[2] Daniel Jurafsky,et al. Robust Machine Translation Evaluation with Entailment Features , 2009, ACL.

[3] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments , 2007, WMT@ACL.

[4] Ondrej Bojar,et al. Results of the WMT14 Metrics Shared Task , 2013 .

[5] Pushpak Bhattacharyya,et al. Some Issues in Automatic Evaluation of English-Hindi MT: More Blues for BLEU , 2006 .

[6] Philipp Koehn,et al. Findings of the 2013 Workshop on Statistical Machine Translation , 2013, WMT@ACL.

[7] Adam Lopez,et al. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies , 2011 .

[8] Alexandra Birch,et al. Reordering Metrics for MT , 2011, ACL.

[9] George R. Doddington,et al. Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[10] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[11] Lluís Màrquez i Villodre,et al. Linguistic measures for automatic machine translation evaluation , 2010, Machine Translation.

[12] Christopher D. Manning,et al. Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[13] Matthew G. Snover,et al. A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[14] Ying Zhang,et al. Interpreting BLEU/NIST Scores: How Much Improvement do We Need to Have a Better System? , 2004, LREC.

[15] Alexandra Birch,et al. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies , 2011 .

[16] Ding Liu,et al. Syntactic Features for Evaluation of Machine Translation , 2005, IEEvaluation@ACL.

[17] Ralph Weischedel,et al. A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .