Ranking vs. Regression in Machine Translation Evaluation

Automatic evaluation of machine translation (MT) systems is an important research topic for the advancement of MT technology. Most automatic evaluation methods proposed to date are score-based: they compute scores that represent translation quality, and MT systems are compared on the basis of these scores. We advocate an alternative perspective of automatic MT evaluation based on ranking. Instead of producing scores, we directly produce a ranking over the set of MT systems to be compared. This perspective is often simpler when the evaluation goal is system comparison. We argue that it is easier to elicit human judgments of ranking and develop a machine learning approach to train on rank data. We compare this ranking method to a score-based regression method on WMT07 data. Results indicate that ranking achieves higher correlation to human judgments, especially in cases where ranking-specific features are used.

[1]  Hermann Ney,et al.  Human Evaluation of Machine Translation Through Binary System Comparisons , 2007, WMT@ACL.

[2]  Alex Kulesza,et al.  A learning approach to improving sentence-level MT evaluation , 2004 .

[3]  Chin-Yew Lin,et al.  Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics , 2004, ACL.

[4]  Jeff A. Bilmes,et al.  Consensus ranking under the exponential model , 2007, UAI.

[5]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[6]  Yoram Singer,et al.  An Efficient Boosting Algorithm for Combining Preferences by , 2013 .

[7]  M. Fligner,et al.  Multistage Ranking Models , 1988 .

[8]  Ming Zhou,et al.  Sentence Level Machine Translation Evaluation as a Ranking , 2007, WMT@ACL.

[9]  Rebecca Hwa,et al.  A Re-examination of Machine Learning Approaches for Sentence-Level MT Evaluation , 2007, ACL.

[10]  Michael Gamon,et al.  A Machine Learning Approach to the Automatic Evaluation of Machine Translation , 2001, ACL.

[11]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[12]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[13]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[14]  Regina Barzilay,et al.  Paraphrasing for Automatic Evaluation , 2006, NAACL.

[15]  Ding Liu,et al.  Syntactic Features for Evaluation of Machine Translation , 2005, IEEvaluation@ACL.

[16]  Philipp Koehn,et al.  (Meta-) Evaluation of Machine Translation , 2007, WMT@ACL.

[17]  Alexander J. Smola,et al.  Support Vector Regression Machines , 1996, NIPS.

[18]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[19]  Alex Kulesza,et al.  Confidence Estimation for Machine Translation , 2004, COLING.