Structured vs. Flat Semantic Role Representations for Machine Translation Evaluation

We argue that failing to capture the degree of contribution of each semantic frame in a sentence explains puzzling results in recent work on the MEANT family of semantic MT evaluation metrics, which have disturbingly indicated that dissociating semantic roles and fillers from their predicates actually improves correlation with human adequacy judgments even though, intuitively, properly segregating event frames should more accurately reflect the preservation of meaning. Our analysis finds that both properly structured and flattened representations fail to adequately account for the contribution of each semantic frame to the overall sentence. We then show that the correlation of HMEANT, the human variant of MEANT, can be greatly improved by introducing a simple length-based weighting scheme that approximates the degree of contribution of each semantic frame to the overall sentence. The new results also show that, without flattening the structure of semantic frames, weighting the degree of each frame's contribution gives HMEANT higher correlations than the previously best-performing flattened model, as well as HTER.

[1]  Philipp Koehn,et al.  Manual and Automatic Evaluation of Machine Translation between European Languages , 2006, WMT@HLT-NAACL.

[2]  Dekai Wu,et al.  SMT Versus AI Redux: How Semantic Frames Evaluate MT More Accurately , 2011, IJCAI.

[3]  Hermann Ney,et al.  CDER: Efficient MT Evaluation Using Block Movements , 2006, EACL.

[4]  Daniel Jurafsky,et al.  Shallow Semantic Parsing using Support Vector Machines , 2004, NAACL.

[5]  Dekai Wu,et al.  Evaluating Machine Translation Utility via Semantic Role Labels , 2010, LREC.

[6]  Ding Liu,et al.  Syntactic Features for Evaluation of Machine Translation , 2005, IEEvaluation@ACL.

[7]  Daniel Jurafsky,et al.  Robust Machine Translation Evaluation with Entailment Features , 2009, ACL.

[8]  Lluís Màrquez i Villodre,et al.  Linguistic Features for Automatic Evaluation of Heterogenous MT Systems , 2007, WMT@ACL.

[9]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[10]  Lluís Màrquez i Villodre,et al.  A Smorgasbord of Features for Automatic MT Evaluation , 2008, WMT@ACL.

[11]  Hermann Ney,et al.  Accelerated DP based search for statistical translation , 1997, EUROSPEECH.

[12]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[13]  Daniel Gildea,et al.  The Proposition Bank: An Annotated Corpus of Semantic Roles , 2005, CL.

[14]  Dekai Wu,et al.  Semantic vs. Syntactic vs. N-gram Structure for Machine Translation Evaluation , 2010, SSST@COLING.

[15]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[16]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[17]  Hermann Ney,et al.  An Evaluation Tool for Machine Translation: Fast Evaluation for MT Research , 2000, LREC.

[18]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[19]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[20]  Dekai Wu,et al.  MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles , 2011, ACL.

[21]  Philipp Koehn,et al.  Findings of the 2010 Joint Workshop on Statistical Machine Translation and Metrics for Machine Translation , 2010, WMT@ACL.