Reducing human assessment of machine translation quality to binary classifiers

This paper presents a method to predict human assessments of machine translation (MT) quality based on the combination of binary classifiers using a coding matrix. The multiclass categorization problem is reduced to a set of binary problems that are solved using standard classification learning algorithms trained on the results of multiple automatic evaluation metrics. Experimental results using a large-scale humanannotated evaluation corpus show that the decomposition into binary classifiers achieves higher classification accuracies than the multiclass categorization problem. In addition, the proposed method achieves a higher correlation with human judgments on the sentence-level compared to standard automatic evaluation measures.

[1]  John S. White,et al.  The ARPA MT Evaluation Methodologies: Evolution, Lessons, and Future Approaches , 1994, AMTA.

[2]  Thomas G. Dietterich,et al.  Solving Multiclass Learning Problems via Error-Correcting Output Codes , 1994, J. Artif. Intell. Res..

[3]  Robert Tibshirani,et al.  Classification by Pairwise Coupling , 1997, NIPS.

[4]  Eiichiro Sumita,et al.  Solutions to Problems Inherent in Spoken-language Translation: The ATR-MATRIX Approach , 1999 .

[5]  Hermann Ney,et al.  An Evaluation Tool for Machine Translation: Fast Evaluation for MT Research , 2000, LREC.

[6]  Yoram Singer,et al.  Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers , 2000, J. Mach. Learn. Res..

[7]  Eiichiro Sumita,et al.  Using multiple edit distances to automatically rank machine translation output , 2001, MTSUMMIT.

[8]  Hermann Ney,et al.  Statistical multi-source translation , 2001, MTSUMMIT.

[9]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[10]  Eiichiro Sumita,et al.  Creating corpora for speech-to-speech translation , 2003, INTERSPEECH.

[11]  Joseph P. Turian,et al.  Evaluation of machine translation and its evaluation , 2003, MTSUMMIT.

[12]  S. Shieber,et al.  A learning approach to improving sentence-level MT evaluation , 2004, TMI.

[13]  Chris Quirk,et al.  Training a Sentence-Level Machine Translation Confidence Measure , 2004, LREC.

[14]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[15]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[16]  Cyril Goutte Automatic Evaluation of Machine Translation Quality , 2006 .

[17]  D. Gildea,et al.  Maximum Correlation Training for Machine Translation Evaluation , 2007 .