论文信息 - Automatic Annotation of Machine Translation Datasets with Binary Quality Judgements

Automatic Annotation of Machine Translation Datasets with Binary Quality Judgements

The automatic estimation of machine translation (MT) output quality is an active research area due to its many potential applications (e.g. aiding human translation and post-editing, re-ranking MT hypotheses, MT system combination). Current approaches to the task rely on supervised learning methods for which high-quality labelled data is fundamental. In this framework, quality estimation (QE) has been mainly addressed as a regression problem where models trained on (source, target) sentence pairs annotated with continuous scores (in the [0-1] interval) are used to assign quality scores (in the same interval) to unseen data. Such definition of the problem assumes that continuous scores are informative and easily interpretable by different users. These assumptions, however, conflict with the subjectivity inherent to human translation and evaluation. On one side, the subjectivity of human judgements adds noise and biases to annotations based on scaled values. This problem reduces the usability of the resulting datasets, especially in application scenarios where a sharp distinction between good and bad translations is needed. On the other side, continuous scores are not always sufficient to decide whether a translation is actually acceptable or not. To overcome these issues, we present an automatic method for the annotation of (source, target) pairs with binary judgements that reflect an empirical, and easily interpretable notion of quality. The method is applied to annotate with binary judgements three QE datasets for different language combinations. The three datasets are combined in a single resource, called BinQE, which can be freely downloaded from http://hlt.fbk.eu/technologies/binqe.

Marco Turchi | Matteo Negri

[1] Tomaz Erjavec,et al. The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[2] Hao-Ren Ke,et al. Plagiarism Detection using ROUGE and WordNet , 2010, ArXiv.

[3] Martin Porter,et al. Snowball: A language for stemming algorithms , 2001 .

[4] Hervé Blanchon,et al. Collection of a Large Database of French-English SMT Output Corrections , 2012, LREC.

[5] Nitin Madnani,et al. Fluency, Adequacy, or HTER? Exploring Different Human Judgments with a Tunable MT Metric , 2009, WMT@EACL.

[6] Philipp Koehn,et al. Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[7] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[8] Maarit Koponen,et al. Comparing human perceptions of post-editing effort with post-editing operations , 2012, WMT@NAACL-HLT.

[9] Matthias Hagen,et al. Overview of the 1st international competition on plagiarism detection , 2009 .

[10] José Guilherme Camargo de Souza,et al. Adaptive Quality Estimation for Machine Translation , 2014, ACL.

[11] Hervé Blanchon,et al. The LIG Machine Translation System for WMT 2010 , 2010, WMT@ACL.