TER-Plus: paraphrase, semantic, and alignment enhancements to Translation Edit Rate

This paper describes a new evaluation metric, TER-Plus (TERp) for automatic evaluation of machine translation (MT). TERp is an extension of Translation Edit Rate (TER). It builds on the success of TER as an evaluation metric and alignment tool and addresses several of its weaknesses through the use of paraphrases, stemming, synonyms, as well as edit costs that can be automatically optimized to correlate better with various types of human judgments. We present a correlation study comparing TERp to BLEU, METEOR and TER, and illustrate that TERp can better evaluate translation adequacy.

[1]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[2]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[3]  Gregory A. Sanders,et al.  The NIST 2008 Metrics for machine translation challenge—overview, methodology, metrics, and results , 2009, Machine Translation.

[4]  Alon Lavie,et al.  The significance of recall in automatic metrics for MT evaluation , 2004, AMTA.

[5]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments , 2007, WMT@ACL.

[6]  Nitin Madnani,et al.  Are Multiple Reference Translations Necessary? Investigating the Value of Paraphrased Reference Translations in Parameter Optimization , 2008, AMTA.

[7]  Hermann Ney,et al.  An Evaluation Tool for Machine Translation: Fast Evaluation for MT Research , 2000, LREC.

[8]  Liang Zhou,et al.  Re-evaluating Machine Translation Results with Paraphrase Support , 2006, EMNLP.

[9]  Alon Lavie,et al.  BLANC: Learning Evaluation Metrics for MT , 2005, HLT.

[10]  Daniel P. Lopresti,et al.  Block Edit Models for Approximate String Matching , 1997, Theor. Comput. Sci..

[11]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[12]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[13]  Mirella Lapata,et al.  Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP) , 2005 .

[14]  Hermann Ney,et al.  CDER: Efficient MT Evaluation Using Block Movements , 2006, EACL.

[15]  Richard M. Schwartz,et al.  Improved Word-Level System Combination for Machine Translation , 2007, ACL.

[16]  Chris Callison-Burch,et al.  Paraphrasing with Bilingual Parallel Corpora , 2005, ACL.

[17]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[18]  Regina Barzilay,et al.  Paraphrasing for Automatic Evaluation , 2006, NAACL.

[19]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[20]  Nitin Madnani,et al.  Fluency, Adequacy, or HTER? Exploring Different Human Judgments with a Tunable MT Metric , 2009, WMT@EACL.

[21]  Adam Kilgarriff,et al.  of the European Chapter of the Association for Computational Linguistics , 2006 .