On the Robustness of Syntactic and Semantic Features for Automatic MT Evaluation

Linguistic metrics based on syntactic and semantic information have proven very effective for Automatic MT Evaluation. However, no results have been presented so far on their performance when applied to heavily ill-formed low quality translations. In order to glean some light into this issue, in this work we present an empirical study on the behavior of a heterogeneous set of metrics based on linguistic analysis in the paradigmatic case of speech translation between non-related languages. Corroborating previous findings, we have verified that metrics based on deep linguistic analysis exhibit a very robust and stable behavior at the system level. However, these metrics suffer a significant decrease at the sentence level. This is in many cases attributable to a loss of recall, due to parsing errors or to a lack of parsing at all, which may be partially ameliorated by backing off to lexical similarity.

[1]  H. Kamp A Theory of Truth and Semantic Representation , 2008 .

[2]  I. Dan Melamed,et al.  Precision and Recall of Machine Translation , 2003, NAACL.

[3]  Michael Paul,et al.  Overview of the IWSLT06 evaluation campaign , 2006, IWSLT.

[4]  Dennis Mehay,et al.  BLEUÂTRE: flattening syntactic dependencies for MT evaluation , 2007, TMI.

[5]  Hermann Ney,et al.  Word Error Rates: Decomposition over POS classes and Applications for Error Analysis , 2007, WMT@ACL.

[6]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[7]  Chin-Yew Lin,et al.  ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation , 2004, COLING.

[8]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[9]  Julio Gonzalo,et al.  MT Evaluation: Human-Like vs. Human Acceptable , 2006, ACL.

[10]  Hermann Ney,et al.  Accelerated DP based search for statistical translation , 1997, EUROSPEECH.

[11]  Julio Gonzalo,et al.  QARLA: A Framework for the Evaluation of Text Summarization Systems , 2005, ACL.

[12]  Lluís Màrquez i Villodre,et al.  Linguistic Features for Automatic Evaluation of Heterogenous MT Systems , 2007, WMT@ACL.

[13]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[14]  Johan Bos Towards Wide-Coverage Semantic Interpretation , 2005 .

[15]  Ding Liu,et al.  Syntactic Features for Evaluation of Machine Translation , 2005, IEEvaluation@ACL.

[16]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[17]  James R. Curran,et al.  Parsing the WSJ Using CCG and Log-Linear Models , 2004, ACL.

[18]  Mark Steedman,et al.  Wide-Coverage Semantic Representations from a CCG Parser , 2004, COLING.

[19]  Hermann Ney,et al.  An Evaluation Tool for Machine Translation: Fast Evaluation for MT Research , 2000, LREC.

[20]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[21]  Andy Way,et al.  Dependency-Based Automatic Evaluation for Machine Translation , 2007, SSST@HLT-NAACL.

[22]  Chin-Yew Lin,et al.  Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics , 2004, ACL.