Corroborating Text Evaluation Results with Heterogeneous Measures

Automatically produced texts (e.g. translations or summaries) are usually evaluated with n-gram based measures such as BLEU or ROUGE, while the wide set of more sophisticated measures that have been proposed in the last years remains largely ignored for practical purposes. In this paper we first present an in-depth analysis of the state of the art in order to clarify this issue. After this, we formalize and verify empirically a set of properties that every text evaluation measure based on similarity to human-produced references satisfies. These properties imply that corroborating system improvements with additional measures always increases the overall reliability of the evaluation process. In addition, the greater the heterogeneity of the measures (which is measurable) the higher their combined reliability. These results support the use of heterogeneous measures in order to consolidate text evaluation results.

[1]  Chin-Yew Lin,et al.  Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics , 2004, ACL.

[2]  Pavel Pecina,et al.  A Simple Automatic MT Evaluation Metric , 2009, WMT@EACL.

[3]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[4]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[5]  Andy Way,et al.  Dependency-Based Automatic Evaluation for Machine Translation , 2007, SSST@HLT-NAACL.

[6]  Ding Liu,et al.  Syntactic Features for Evaluation of Machine Translation , 2005, IEEvaluation@ACL.

[7]  Daniel Jurafsky,et al.  Robust Machine Translation Evaluation with Entailment Features , 2009, ACL.

[8]  Alon Lavie,et al.  BLANC: Learning Evaluation Metrics for MT , 2005, HLT.

[9]  Hermann Ney,et al.  CDER: Efficient MT Evaluation Using Block Movements , 2006, EACL.

[10]  Chris Quirk,et al.  Dependency Treelet Translation: Syntactically Informed Phrasal SMT , 2005, ACL.

[11]  Chin-Yew Lin,et al.  ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation , 2004, COLING.

[12]  Rebecca Hwa,et al.  A Re-examination of Machine Learning Approaches for Sentence-Level MT Evaluation , 2007, ACL.

[13]  Michael Gamon,et al.  Sentence-level MT evaluation without reference translations: beyond language modeling , 2005, EAMT.

[14]  Manabu Okumura,et al.  Kernel-based Approach for Automatic Evaluation of Natural Language Generation Technologies: Application to Automatic Summarization , 2005, HLT/EMNLP.

[15]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[16]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[17]  Dennis Mehay,et al.  BLEUÂTRE: flattening syntactic dependencies for MT evaluation , 2007, TMI.

[18]  Ding Liu,et al.  Stochastic Iterative Alignment for Machine Translation Evaluation , 2006, ACL.

[19]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[20]  Eduard H. Hovy,et al.  Summarization Evaluation Using Transformed Basic Elements , 2008, TAC.

[21]  I. Dan Melamed,et al.  Precision and Recall of Machine Translation , 2003, NAACL.

[22]  F. Reeder,et al.  The naming of things and the confusion of tongues: an MT metric , 2001, MTSUMMIT.

[23]  Mari Ostendorf,et al.  Expected dependency pair match: predicting translation quality with expected syntactic structure , 2009, Machine Translation.

[24]  Joseph P. Turian,et al.  Evaluation of machine translation and its evaluation , 2003, MTSUMMIT.

[25]  Hwee Tou Ng,et al.  MAXSIM: A Maximum Similarity Metric for Machine Translation Evaluation , 2008, ACL.

[26]  Julio Gonzalo,et al.  The Contribution of Linguistic Features to Automatic Machine Translation Evaluation , 2009, ACL/IJCNLP.

[27]  Chris Quirk,et al.  Training a Sentence-Level Machine Translation Confidence Measure , 2004, LREC.

[28]  Karolina Owczarzak DEPEVAL(summ): Dependency-based Evaluation for Automatic Summaries , 2009, ACL/IJCNLP.

[29]  E. Sumita,et al.  Reducing human assessment of machine translation quality to binary classifiers , 2007, TMI.

[30]  Ding Liu,et al.  Source-Language Features and Maximum Correlation Training for Machine Translation Evaluation , 2007, NAACL.

[31]  Alex Kulesza,et al.  A learning approach to improving sentence-level MT evaluation , 2004 .

[32]  Andy Way,et al.  Labelled Dependencies in Machine Translation Evaluation , 2007, WMT@ACL.

[33]  Rebecca Hwa,et al.  Regression for Sentence-Level MT Evaluation with Pseudo References , 2007, ACL.

[34]  Andy Way,et al.  Evaluating machine translation with LFG dependencies , 2007, Machine Translation.

[35]  Philipp Koehn,et al.  Findings of the 2010 Joint Workshop on Statistical Machine Translation and Metrics for Machine Translation , 2010, WMT@ACL.

[36]  J. Giménez,et al.  Empirical machine translation and its evaluation , 2008, EAMT.

[37]  Hermann Ney,et al.  An Evaluation Tool for Machine Translation: Fast Evaluation for MT Research , 2000, LREC.

[38]  Philipp Koehn,et al.  Further Meta-Evaluation of Machine Translation , 2008, WMT@ACL.

[39]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[40]  Philipp Koehn,et al.  Findings of the 2009 Workshop on Statistical Machine Translation , 2009, WMT@EACL.

[41]  Mark Przybocki,et al.  NIST 2005 machine translation evaluation official results , 2005 .

[42]  Hermann Ney,et al.  Word Error Rates: Decomposition over POS classes and Applications for Error Analysis , 2007, WMT@ACL.

[43]  Lluís Màrquez i Villodre,et al.  Asiya: An Open Toolkit for Automatic Machine Translation (Meta-)Evaluation , 2010, Prague Bull. Math. Linguistics.

[44]  Christopher Culy,et al.  The limits of n-gram translation evaluation metrics , 2003, MTSUMMIT.

[45]  Lluís Màrquez i Villodre,et al.  Heterogeneous Automatic MT Evaluation Through Non-Parametric Metric Combinations , 2008, IJCNLP.

[46]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[47]  Julio Gonzalo,et al.  MT Evaluation: Human-Like vs. Human Acceptable , 2006, ACL.

[48]  Hermann Ney,et al.  Accelerated DP based search for statistical translation , 1997, EUROSPEECH.

[49]  Michael Gamon,et al.  A Machine Learning Approach to the Automatic Evaluation of Machine Translation , 2001, ACL.

[50]  Lluís Màrquez i Villodre,et al.  Linguistic Features for Automatic Evaluation of Heterogenous MT Systems , 2007, WMT@ACL.

[51]  Eiichiro Sumita,et al.  Using multiple edit distances to automatically rank machine translation output , 2001, MTSUMMIT.

[52]  Julio Gonzalo,et al.  QARLA: A Framework for the Evaluation of Text Summarization Systems , 2005, ACL.