On the practice of error analysis for machine translation evaluation

Error analysis is a means to assess machine translation output in qualitative terms, which can be used as a basis for the generation of error profiles for different systems. As for other subjective approaches to evaluation it runs the risk of low inter-annotator agreement, but very often in papers applying error analysis to MT, this aspect is not even discussed. In this paper, we report results from a comparative evaluation of two systems where agreement initially was low, and discuss the different ways we used to improve it. We compared the effects of using more or less fine-grained taxonomies, and the possibility to restrict analysis to short sentences only. We report results on inter-annotator agreement before and after measures were taken, on error categories that are most likely to be confused, and on the possibility to establish error profiles also in the absence of a high inter-annotator agreement.

[1]  Mary A. Flanagan,et al.  Error Classification for MT Evaluation , 1994, AMTA.

[2]  Philipp Koehn,et al.  (Meta-) Evaluation of Machine Translation , 2007, WMT@ACL.

[3]  Sara Stymne,et al.  Blast: A Tool for Error Analysis of Machine Translation Output , 2011, ACL.

[4]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[5]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[6]  José A. R. Fonollosa,et al.  Linguistic-based Evaluation Criteria to identify Statistical Machine Translation Errors , 2010, EAMT.

[7]  Stefan Riezler,et al.  On Some Pitfalls in Automatic Evaluation and Significance Testing for MT , 2005, IEEvaluation@ACL.

[8]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[9]  Eric Atwell,et al.  A fluency error categorization scheme to guide automated machine translation evaluation , 2004, AMTA.

[10]  Aljoscha Burchardt,et al.  From Human to Automatic Error Classification for Machine Translation Output , 2011, EAMT.

[11]  Sara Stymne,et al.  Processing of Swedish compounds for phrase-based statistical machine translation , 2008, EAMT.

[12]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[13]  Hermann Ney,et al.  Error Analysis of Statistical Machine Translation Output , 2006, LREC.

[14]  Philipp Koehn,et al.  Enriching Morphologically Poor Languages for Statistical Machine Translation , 2008, ACL.

[15]  Petra Saskia Bayerl,et al.  What Determines Inter-Coder Agreement in Manual Annotations? A Meta-Analytic Investigation , 2011, CL.

[16]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[17]  Alina Secar Translation Evaluation-a State of the Art Survey , 2006 .