Automated Paraphrase Lattice Creation for HyTER Machine Translation Evaluation

We propose a variant of a well-known machine translation (MT) evaluation metric, HyTER (Dreyer and Marcu, 2012), which exploits reference translations enriched with meaning equivalent expressions. The original HyTER metric relied on hand-crafted paraphrase networks which restricted its applicability to new data. We test, for the first time, HyTER with automatically built paraphrase lattices. We show that although the metric obtains good results on small and carefully curated data with both manually and automatically selected substitutes, it achieves medium performance on much larger and noisier datasets, demonstrating the limits of the metric for tuning and evaluation of current MT systems.

[1]  Shankar Kumar,et al.  Local Phrase Reordering Models for Statistical Machine Translation , 2005, HLT.

[2]  Chris Callison-Burch,et al.  PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification , 2015, ACL.

[3]  Khalil Sima'an,et al.  BEER 1.1: ILLC UvA submission to metrics and tuning task , 2015, WMT@EMNLP.

[4]  Daniel Marcu,et al.  HyTER: Meaning-Equivalent Semantics for Translation Evaluation , 2012, NAACL.

[5]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[6]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[7]  Arianna Bisazza,et al.  Neural versus Phrase-Based Machine Translation Quality: a Case Study , 2016, EMNLP.

[8]  Benjamin Van Durme,et al.  Annotated Gigaword , 2012, AKBC-WEKEX@NAACL-HLT.

[9]  Lucia Specia,et al.  CobaltF: A Fluent Metric for MT Evaluation , 2016, WMT.

[10]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[11]  Omer Levy,et al.  A Simple Word Embedding Model for Lexical Substitution , 2015, VS@HLT-NAACL.

[12]  Johan Schalkwyk,et al.  OpenFst: A General and Efficient Weighted Finite-State Transducer Library , 2007, CIAA.

[13]  Philipp Koehn,et al.  Findings of the 2015 Workshop on Statistical Machine Translation , 2015, WMT@EMNLP.

[14]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[15]  Cyril Goutte Automatic Evaluation of Machine Translation Quality , 2006 .

[16]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[17]  Nitin Madnani,et al.  Fluency, Adequacy, or HTER? Exploring Different Human Judgments with a Tunable MT Metric , 2009, WMT@EACL.

[18]  Ondrej Bojar,et al.  Results of the WMT17 Metrics Shared Task , 2017, WMT.

[19]  Qun Liu,et al.  CASICT-DCU Participation in WMT2015 Metrics Task , 2015, WMT@EMNLP.

[20]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[21]  Alon Lavie,et al.  Extending the METEOR Machine Translation Evaluation Metric to the Phrase Level , 2010, NAACL.

[22]  Chris Callison-Burch,et al.  PPDB: The Paraphrase Database , 2013, NAACL.

[23]  Dekai Wu,et al.  Fully Automatic Semantic MT Evaluation , 2012, WMT@NAACL-HLT.

[24]  Timothy Baldwin,et al.  Can machine translation systems be evaluated by the crowd alone , 2015, Natural Language Engineering.