A Parallel Corpus for Evaluating Machine Translation between Arabic and European Languages

We present Arab-Acquis, a large publicly available dataset for evaluating machine translation between 22 European languages and Arabic. Arab-Acquis consists of over 12,000 sentences from the JRC-Acquis (Acquis Communautaire) corpus translated twice by professional translators, once from English and once from French, and totaling over 600,000 words. The corpus follows previous data splits in the literature for tuning, development, and testing. We describe the corpus and how it was created. We also present the first benchmarking results on translating to and from Arabic for 22 European languages.

[1]  Barry Haddow,et al.  Interactive Assistance to Human Translators using Statistical Machine Translation Methods , 2009, MTSUMMIT.

[2]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[3]  Philipp Koehn,et al.  Predicting Success in Machine Translation , 2008, EMNLP.

[4]  Shuly Wintner,et al.  On the features of translationese , 2015, Digit. Scholarsh. Humanit..

[5]  Tomaz Erjavec,et al.  The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[6]  Nizar Habash,et al.  Techniques for Arabic morphological detokenization and orthographic denormalization , 2010 .

[7]  Nizar Habash,et al.  Language Independent Connectivity Strength Features for Phrase Pivot Statistical Machine Translation , 2013, ACL 2013.

[8]  Jun Hu,et al.  Improving Arabic-Chinese Statistical Machine Translation using English as Pivot Language , 2009, WMT@EACL.

[9]  Chris Callison-Burch,et al.  Machine Translation of Arabic Dialects , 2012, NAACL.

[10]  Mauro Cettolo,et al.  Bootstrapping Arabic-Italian SMT through Comparable Texts and Pivot Translation , 2011, EAMT.

[11]  Hitoshi Isahara,et al.  A Comparison of Pivot Methods for Phrase-Based Statistical Machine Translation , 2007, NAACL.

[12]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[13]  Joseph Olive,et al.  Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploitation , 2011 .

[14]  Robert Dale,et al.  United Nations General Assembly Resolutions : a six-language parallel corpus , 2009 .

[15]  Nizar Habash,et al.  On Arabic Transliteration , 2007 .

[16]  Martin Wattenberg,et al.  Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation , 2016, TACL.

[17]  Nizar Habash,et al.  Orthographic and morphological processing for English–Arabic statistical machine translation , 2011, Machine Translation.

[18]  Khalid Choukri,et al.  Evaluation Methodology and Results for English-to-Arabic MT , 2011, MTSUMMIT.

[19]  M. M. Boudabous,et al.  Arabic WordNet semantic relations enrichment through morpho-lexical patterns , 2013, 2013 1st International Conference on Communications, Signal Processing, and their Applications (ICCSPA).

[20]  Nizar Habash,et al.  Machine translation between Hebrew and Arabic , 2011, Machine Translation.

[21]  Nizar Habash,et al.  Introduction to Arabic Natural Language Processing , 2010, Introduction to Arabic Natural Language Processing.

[22]  Philipp Koehn,et al.  462 Machine Translation Systems for Europe , 2009, MTSUMMIT.

[23]  Michel Simard,et al.  Using cognates to align sentences in bilingual corpora , 1993, TMI.

[24]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[25]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[26]  Alon Lavie,et al.  The impact of Arabic morphological segmentation on broad-coverage English-to-Arabic statistical machine translation , 2012, AMTA.