Assessing the Accuracy of Discourse Connective Translations: Validation of an Automatic Metric

Automatic metrics for the evaluation of machine translation (MT) compute scores that characterize globally certain aspects of MT quality such as adequacy and fluency. This paper introduces a reference-based metric that is focused on a particular class of function words, namely discourse connectives, of particular importance for text structuring, and rather challenging for MT. To measure the accuracy of connective translation (ACT), the metric relies on automatic word-level alignment between a source sentence and respectively the reference and candidate translations, along with other heuristics for comparing translations of discourse connectives. Using a dictionary of equivalents, the translations are scored automatically, or, for better precision, semi-automatically. The precision of the ACT metric is assessed by human judges on sample data for English/French and English/Arabic translations: the ACT scores are on average within 2% of human scores. The ACT metric is then applied to several commercial and research MT systems, providing an assessment of their performance on discourse connectives.

[1]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[2]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[3]  Andrei Popescu-Belis,et al.  Machine Translation of Labeled Discourse Connectives , 2012, AMTA.

[4]  Philipp Koehn,et al.  Aiding Pronoun Translation with Co-Reference Resolution , 2010, WMT@ACL.

[5]  François Yvon,et al.  Contrastive Lexical Evaluation of Machine Translation , 2010, LREC.

[6]  Livio Robaldo,et al.  The Penn Discourse TreeBank 2.0. , 2008, LREC.

[7]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[8]  Hermann Ney,et al.  Towards Automatic Error Analysis of Machine Translation Output , 2011, CL.

[9]  Alon Lavie,et al.  METEOR-NEXT and the METEOR Paraphrase Tables: Improved Evaluation Support for Five Target Languages , 2010, WMT@ACL.

[10]  Hermann Ney,et al.  Improved Statistical Alignment Models , 2000, ACL.

[11]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[12]  Andy Way,et al.  A Framework for Diagnostic Evaluation of MT Based on Linguistic Checkpoints , 2011, MTSUMMIT.

[13]  Chin-Yew Lin,et al.  Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics , 2004, ACL.

[14]  S. Zufferey,et al.  English and French causal connectives in contrast , 2012 .

[15]  Tiejun Zhao,et al.  Diagnostic Evaluation of Machine Translation Systems Using Automatically Constructed Linguistic Check-Points , 2008, COLING.

[16]  Andrei Popescu-Belis,et al.  Translating English Discourse Connectives into Arabic: a Corpus-based Analysis and an Evaluation Metric , 2012, AMTA 2012.

[17]  Andrei Popescu-Belis,et al.  Using Sense-labeled Discourse Connectives for Statistical Machine Translation , 2012, ESIRMT/HyTra@EACL.