Automatic Annotation of Machine Translation Datasets with Binary Quality Judgements

The automatic estimation of machine translation (MT) output quality is an active research area due to its many potential applications (e.g. aiding human translation and post-editing, re-ranking MT hypotheses, MT system combination). Current approaches to the task rely on supervised learning methods for which high-quality labelled data is fundamental. In this framework, quality estimation (QE) has been mainly addressed as a regression problem where models trained on (source, target) sentence pairs annotated with continuous scores (in the [0-1] interval) are used to assign quality scores (in the same interval) to unseen data. Such definition of the problem assumes that continuous scores are informative and easily interpretable by different users. These assumptions, however, conflict with the subjectivity inherent to human translation and evaluation. On one side, the subjectivity of human judgements adds noise and biases to annotations based on scaled values. This problem reduces the usability of the resulting datasets, especially in application scenarios where a sharp distinction between “good” and “bad” translations is needed. On the other side, continuous scores are not always sufficient to decide whether a translation is actually acceptable or not. To overcome these issues, we present an automatic method for the annotation of (source, target) pairs with binary judgements that reflect an empirical, and easily interpretable notion of quality. The method is applied to annotate with binary judgements three QE datasets for different language combinations. The three datasets are combined in a single resource, called BinQE, which can be freely downloaded from http://hlt.fbk.eu/technologies/binqe.

[1]  Tomaz Erjavec,et al.  The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[2]  Hao-Ren Ke,et al.  Plagiarism Detection using ROUGE and WordNet , 2010, ArXiv.

[3]  Martin Porter,et al.  Snowball: A language for stemming algorithms , 2001 .

[4]  Hervé Blanchon,et al.  Collection of a Large Database of French-English SMT Output Corrections , 2012, LREC.

[5]  Nitin Madnani,et al.  Fluency, Adequacy, or HTER? Exploring Different Human Judgments with a Tunable MT Metric , 2009, WMT@EACL.

[6]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[7]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[8]  Maarit Koponen,et al.  Comparing human perceptions of post-editing effort with post-editing operations , 2012, WMT@NAACL-HLT.

[9]  Matthias Hagen,et al.  Overview of the 1st international competition on plagiarism detection , 2009 .

[10]  José Guilherme Camargo de Souza,et al.  Adaptive Quality Estimation for Machine Translation , 2014, ACL.

[11]  Hervé Blanchon,et al.  The LIG Machine Translation System for WMT 2010 , 2010, WMT@ACL.

[12]  Holger Schwenk,et al.  Incremental adaptation using translation information and post-editing analysis , 2012, IWSLT.

[13]  Philipp Koehn,et al.  Findings of the 2013 Workshop on Statistical Machine Translation , 2013, WMT@ACL.

[14]  Philipp Koehn,et al.  Findings of the 2012 Workshop on Statistical Machine Translation , 2012, WMT@NAACL-HLT.

[15]  Lucia Specia,et al.  Exploiting Objective Annotations for Minimising Translation Post-editing Effort , 2011, EAMT.

[16]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[17]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[18]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[19]  Nello Cristianini,et al.  Support vector machines , 2009 .

[20]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[21]  Nello Cristianini,et al.  Estimating the Sentence-Level Quality of Machine Translation Systems , 2009, EAMT.

[22]  Marcello Federico,et al.  Coping with the Subjectivity of Human Judgements in MT Quality Estimation , 2013, WMT@ACL.

[23]  Yaser Al-Onaizan,et al.  Goodness: A Method for Measuring Machine Translation Confidence , 2011, ACL.

[24]  Radu Soricut,et al.  TrustRank: Inducing Trust in Automatic Translations via Ranking , 2010, ACL.

[25]  Marcello Federico,et al.  Match without a Referee: Evaluating MT Adequacy without Reference Translations , 2012, WMT@NAACL-HLT.