HUME: Human UCCA-Based Evaluation of Machine Translation

Human evaluation of machine translation normally uses sentence-level measures such as relative ranking or adequacy scales. However, these provide no insight into possible errors, and do not scale well with sentence length. We argue for a semantics-based evaluation, which captures what meaning components are retained in the MT output, thus providing a more fine-grained analysis of translation quality, and enabling the construction and tuning of semantics-based MT. We present a novel human semantic evaluation measure, Human UCCA-based MT Evaluation (HUME), building on the UCCA semantic representation scheme. HUME covers a wider range of semantic phenomena than previous methods and does not rely on semantic annotation of the potentially garbled MT output. We experiment with four language pairs, demonstrating HUME's broad applicability, and report good inter-annotator agreement rates and correlation with human adequacy scores.

[1]  Aljoscha Burchardt,et al.  Assessing Inter-Annotator Agreement for Translation Error Annotation , 2014 .

[2]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[3]  Timothy Baldwin,et al.  Accurate Evaluation of Segment-level Machine Translation Metrics , 2015, NAACL.

[4]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[5]  Philipp Koehn,et al.  The Feasibility of HMEANT as a Human MT Evaluation Metric , 2013, WMT@ACL.

[6]  Philipp Koehn,et al.  Findings of the 2015 Workshop on Statistical Machine Translation , 2015, WMT@EMNLP.

[7]  Lucia Specia,et al.  Reference Bias in Monolingual Machine Translation Evaluation , 2016, ACL.

[8]  E. Koktová The meaning of the sentence in its semantic and pragmatic aspects , 1991 .

[9]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[10]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[11]  Robert Dixon Basic Linguistic Theory: methodology , 2010 .

[12]  Irina Galinskaya,et al.  Applying HMEANT to English-Russian Translations , 2014, SSST@EMNLP.

[13]  Timothy Baldwin,et al.  Can machine translation systems be evaluated by the crowd alone , 2015, Natural Language Engineering.

[14]  Philipp Koehn,et al.  Abstract Meaning Representation for Sembanking , 2013, LAW@ACL.

[15]  Ari Rappoport,et al.  Universal Conceptual Cognitive Annotation (UCCA) , 2013, ACL.

[16]  Yvette Graham,et al.  Improving Evaluation of Machine Translation Quality Estimation , 2015, ACL.

[17]  Philipp Koehn,et al.  Findings of the 2014 Workshop on Statistical Machine Translation , 2014, WMT@ACL.

[18]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[19]  Stephan Oepen,et al.  Discriminant-Based MRS Banking , 2006, LREC.

[20]  Ding Liu,et al.  Syntactic Features for Evaluation of Machine Translation , 2005, IEEvaluation@ACL.

[21]  Robert Dixon,et al.  Basic Linguistic Theory: grammatical topics , 2010 .

[22]  Pedro Marinotti,et al.  Measuring Semantic Preservation in Machine Translation with HCOMET: Human Cognitive Metric for Evaluating Translation , 2014 .

[23]  Jörg Tiedemann,et al.  News from OPUS — A collection of multilingual parallel corpora with tools and interfaces , 2009 .

[24]  Ari Rappoport,et al.  Conceptual Annotations Preserve Structure Across Translations: A French-English Case Study , 2015 .

[25]  Lluís Màrquez i Villodre,et al.  Linguistic Features for Automatic Evaluation of Heterogenous MT Systems , 2007, WMT@ACL.

[26]  Andy Way,et al.  Evaluating machine translation with LFG dependencies , 2007, Machine Translation.

[27]  Dekai Wu,et al.  On the reliability and inter-annotator agreement of human semantic MT evaluation via HMEANT , 2014, LREC.

[28]  Dekai Wu,et al.  Towards a Predicate-Argument Evaluation for MT , 2012, SSST@ACL.

[29]  Cyril Goutte Automatic Evaluation of Machine Translation Quality , 2006 .

[30]  Robert Dixon Basic Linguistic Theory: further grammatical topics , 2012 .

[31]  Ondrej Bojar,et al.  Evaluating Machine Translation Quality Using Short Segments Annotations , 2015, Prague Bull. Math. Linguistics.

[32]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[33]  Dekai Wu,et al.  Structured vs. Flat Semantic Role Representations for Machine Translation Evaluation , 2011, SSST@ACL.

[34]  Marie Mikulová,et al.  Announcing Prague Czech-English Dependency Treebank 2.0 , 2012, LREC.

[35]  Ondrej Bojar,et al.  A Grain of Salt for the WMT Manual Evaluation , 2011, WMT@EMNLP.