Study and correlation analysis of linguistic, perceptual, and automatic machine translation evaluations

Evaluation of machine translation output is an important task. Various human evaluation techniques as well as automatic metrics have been proposed and investigated in the last decade. However, very few evaluation methods take the linguistic aspect into account. In this article, we use an objective evaluation method for machine translation output that classifies all translation errors into one of the five following linguistic levels: orthographic, morphological, lexical, semantic, and syntactic. Linguistic guidelines for the target language are required, and human evaluators use them in to classify the output errors. The experiments are performed on Englishto-Catalan and Spanish-to-Catalan translation outputs generated by four different systems: 2 rule-based and 2 statistical. All translations are evaluated using the 3 following methods: a standard human perceptual evaluation method, several widely used automatic metrics, and the human linguistic evaluation. Pearson and Spearman correlation coefficients between the linguistic, perceptual, and automatic results are then calculated, showing that the semantic level correlates significantly with both perceptual evaluation and automatic metrics.

[1]  Bonnie J. Dorr,et al.  Machine Translation: A View from the Lexicon , 1994, CL.

[2]  Hermann Ney,et al.  Phrase-Based Statistical Machine Translation , 2002, KI.

[3]  Mary A. Flanagan,et al.  Error Classification for MT Evaluation , 1994, AMTA.

[4]  Srinivas Bangalore,et al.  Stochastic Finite-State Models for Spoken Language Machine Translation , 2000, Machine Translation.

[5]  José A. R. Fonollosa,et al.  Linguistic-based Evaluation Criteria to identify Statistical Machine Translation Errors , 2010, EAMT.

[6]  Doug Arnold,et al.  Machine Translation: An Introductory Guide , 1994 .

[7]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[8]  H. Ney,et al.  Machine translations: statistical approach with additional linguistic knowledge , 2009 .

[9]  Susanne Heizmann,et al.  Review of Machine translation: an introductory guide by D. Arnold, L. Balkan, R. Lee Humphreys, S. Meijer, and L. Sadler. NCC Blackwell 1994. , 1995 .

[10]  C. Spearman The proof and measurement of association between two things. , 2015, International journal of epidemiology.

[11]  Adam Lopez,et al.  Statistical machine translation , 2007, CSUR.

[12]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments , 2007, WMT@ACL.

[13]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[14]  Hermann Ney,et al.  Error Analysis of Statistical Machine Translation Output , 2006, LREC.

[15]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[16]  Lluís Màrquez i Villodre,et al.  Linguistic Features for Automatic Evaluation of Heterogenous MT Systems , 2007, WMT@ACL.

[17]  Jacob Cohen,et al.  Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .

[18]  José B. Mariño,et al.  Improving a Catalan-Spanish Statistical Translation System using Morphosyntactic Knowledge , 2009, EAMT.

[19]  Hervé Bourlard,et al.  On the Use of Information Retrieval Measures for Speech Recognition Evaluation , 2004 .

[20]  Daniel Radzinski Review of Machine translation: a view from the lexicon by Bonnie Jean Dorr. The MIT Press 1993. , 1994 .

[21]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[22]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[23]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[24]  Nitin Madnani,et al.  TER-Plus: paraphrase, semantic, and alignment enhancements to Translation Edit Rate , 2009, Machine Translation.

[25]  José B. Mariño,et al.  Overcoming statistical machine translation limitations: error analysis and proposed solutions for the Catalan–Spanish language pair , 2011, Lang. Resour. Evaluation.

[26]  José B. Mariño,et al.  N-gram-based Machine Translation , 2006, CL.

[27]  Francisco Casacuberta Finite-state transducers for speech-input translation , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[28]  Philipp Koehn,et al.  Findings of the 2009 Workshop on Statistical Machine Translation , 2009, WMT@EACL.

[29]  Francis M. Tyers,et al.  The Apertium machine translation platform: five years on , 2009 .

[30]  Hermann Ney,et al.  Syntax-Oriented Evaluation Measures for Machine Translation Output , 2009, WMT@EACL.

[31]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[32]  W. J. Hutchins Machine Translation: Past, Present, Future , 1986 .