CHARCUT: Human-Targeted Character-Based MT Evaluation with Loose Differences

We present CHARCUT, a character-based machine translation evaluation metric derived from a human-targeted segment difference visualisation algorithm. It combines an iterative search for longest common substrings between the candidate and the reference translation with a simple lengthbased threshold, enabling loose differences that limit noisy character matches. Its main advantage is to produce scores that directly reflect human-readable string differences, making it a useful support tool for the manual analysis of MT output and its display to end users. Experiments on WMT16 metrics task data show that it is on par with the best “untrained” metrics in terms of correlation with human judgement, well above BLEU and TER baselines, on both system and segment tasks.

[1]  Max M. Louwerse,et al.  A Comparison of String Similarity Measures for Toponym Matching , 2013, COMP '13.

[2]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[3]  Ondrej Bojar,et al.  Results of the WMT13 Metrics Shared Task , 2015, WMT@EMNLP.

[4]  Eleftherios Avramidis,et al.  MT-ComparEval: Graphical evaluation interface for Machine Translation development , 2015, Prague Bull. Math. Linguistics.

[5]  Ondrej Bojar,et al.  HUME: Human UCCA-Based Evaluation of Machine Translation , 2016, EMNLP.

[6]  Josef van Genabith,et al.  Machine Translation Evaluation using Recurrent Neural Networks , 2015, WMT@EMNLP.

[7]  Khalil Sima'an,et al.  BEER 1.1: ILLC UvA submission to metrics and tuning task , 2015, WMT@EMNLP.

[8]  Lucia Specia,et al.  CobaltF: A Fluent Metric for MT Evaluation , 2016, WMT.

[9]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments , 2007, WMT@ACL.

[10]  Qun Liu,et al.  CASICT-DCU Participation in WMT2015 Metrics Task , 2015, WMT@EMNLP.

[11]  Hwee Tou Ng,et al.  Automatic Evaluation of Chinese Translation Output: Word-Level or Character-Level? , 2011, ACL.

[12]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[13]  Maja Popovic,et al.  chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.

[14]  C Friedman,et al.  Tolerating spelling errors during patient validation. , 1992, Computers and biomedical research, an international journal.

[15]  Ondrej Bojar,et al.  Results of the WMT17 Metrics Shared Task , 2017, WMT.

[16]  Hermann Ney,et al.  CharacTer: Translation Edit Rate on Character Level , 2016, WMT.

[17]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[18]  Sharon O'Brien,et al.  Towards predicting post-editing productivity , 2011, Machine Translation.

[19]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[20]  Yves Lepage,et al.  BLEU in Characters: Towards Automatic MT Evaluation in Languages without Word Delimiters , 2004, IJCNLP.

[21]  Marcello Federico,et al.  MT Quality Estimation for Computer-assisted Translation: Does it Really Help? , 2015, ACL.

[22]  Philipp Koehn,et al.  The MateCat Tool , 2014, COLING.

[23]  Zhiming Chen,et al.  Extract Domain-specific Paraphrase from Monolingual Corpus for Automatic Evaluation of Machine Translation , 2016, WMT.

[24]  Lluís Màrquez i Villodre,et al.  Asiya: An Open Toolkit for Automatic Machine Translation (Meta-)Evaluation , 2010, Prague Bull. Math. Linguistics.