A Human Judgement Corpus and a Metric for Arabic MT Evaluation

We present a human judgments datasetand an adapted metric for evaluation ofArabic machine translation. Our mediumscaledataset is the first of its kind for Arabicwith high annotation quality. We usethe dataset to adapt the BLEU score forArabic. Our score (AL-BLEU) providespartial credits for stem and morphologicalmatchings of hypothesis and referencewords. We evaluate BLEU, METEOR andAL-BLEU on our human judgments corpusand show that AL-BLEU has the highestcorrelation with human judgments. Weare releasing the dataset and software tothe research community.

[1]  Nizar Habash,et al.  Automatic Error Analysis for Morphologically Rich Languages , 2011 .

[2]  Hermann Ney,et al.  Syntax-Oriented Evaluation Measures for Machine Translation Output , 2009, WMT@EACL.

[3]  Hwee Tou Ng,et al.  TESLA at WMT 2011: Translation Evaluation and Tunable Metric , 2011, WMT@EMNLP.

[4]  Roland Kuhn,et al.  AMBER: A Modified BLEU, Enhanced Ranking Metric , 2011, WMT@EMNLP.

[5]  Hwee Tou Ng,et al.  TESLA: Translation Evaluation of Sentences with Linear-Programming-Based Analysis , 2010, WMT@ACL.

[6]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[7]  Kemal Oflazer,et al.  BLEU+: a Tool for Fine-Grained BLEU Computation , 2008, LREC.

[8]  Ding Liu,et al.  Syntactic Features for Evaluation of Machine Translation , 2005, IEEvaluation@ACL.

[9]  Jacob Cohen,et al.  Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .

[10]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[11]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[12]  Nizar Habash,et al.  Orthographic and morphological processing for English–Arabic statistical machine translation , 2011, Machine Translation.

[13]  Alon Lavie,et al.  Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems , 2011, WMT@EMNLP.

[14]  Philipp Koehn,et al.  Findings of the 2011 Workshop on Statistical Machine Translation , 2011, WMT@EMNLP.

[15]  Khalid Choukri,et al.  Cooperation for Arabic Language Resources and Tools - The MEDAR Project , 2010, LREC.

[16]  Nizar Habash,et al.  MADA + TOKAN : A Toolkit for Arabic Tokenization , Diacritization , Morphological Disambiguation , POS Tagging , Stemming and Lemmatization , 2009 .

[17]  Christian Federmann,et al.  Appraise: an Open-Source Toolkit for Manual Evaluation of MT Output , 2012, Prague Bull. Math. Linguistics.

[18]  Philipp Koehn,et al.  Results of the WMT15 Metrics Shared Task , 2015, WMT@EMNLP.

[19]  Maja Popovic Morphemes and POS tags for n-gram based evaluation metrics , 2011, WMT@EMNLP.

[20]  Pavel Pecina,et al.  A Simple Automatic MT Evaluation Metric , 2009, WMT@EACL.

[21]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .

[22]  Yves Lepage,et al.  BLEU in Characters: Towards Automatic MT Evaluation in Languages without Word Delimiters , 2004, IJCNLP.

[23]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[24]  Nitin Madnani,et al.  TER-Plus: paraphrase, semantic, and alignment enhancements to Translation Edit Rate , 2009, Machine Translation.

[25]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[26]  Eric Brill,et al.  A Unified Framework For Automatic Evaluation Using 4-Gram Co-occurrence Statistics , 2004, ACL.

[27]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.