Benchmarking SMT Performance for Farsi Using the TEP++ Corpus

Statistical machine translation (SMT) suffers from various problems which are exacerbated where training data is in short supply. In this paper we address the data sparsity problem in the Farsi (Persian) language and introduce a new parallel corpus, TEP++. Compared to previous results the new dataset is more efficient for Farsi SMT engines and yields better output. In our experiments using TEP++ as bilingual training data and BLEU as a metric, we achieved improvements of +11.17 (60%) and +7.76 (63.92%) in the Farsi– English and English–Farsi directions, respectively. Furthermore we describe an engine (SF2FF) to translate between formal and informal Farsi which in terms of syntax and terminology can be seen as different languages. The SF2FF engine also works as an intelligent normalizer for Farsi texts. To demonstrate its use, SF2FF was used to clean the IWSLT–2013 dataset to produce normalized data, which gave improvements in translation quality over FBK’s Farsi engine when used as training data

[1]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[2]  Heshaam Faili,et al.  TEP: Tehran English-Persian Parallel Corpus , 2011, CICLing.

[3]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[4]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[5]  Behrang Q. Zadeh,et al.  The First Parallel Multilingual Corpus of Persian: Toward a Persian BLARK , 2014, ArXiv.

[6]  Rémi Zajac,et al.  Black-Box / Glass-Box Evaluation in Shiraz , 1998 .

[7]  Nizar Habash,et al.  Orthographic and Morphological Processing for Persian-to-English Statistical Machine Translation , 2013, IJCNLP.

[8]  Marcello Federico,et al.  FBK’s machine translation systems for the IWSLT 2013 evaluation campaign , 2013, IWSLT.

[9]  Heshaam Faili,et al.  A swarm-inspired re-ranker system for statistical machine translation , 2015, Comput. Speech Lang..

[10]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[11]  Rémi Zajac,et al.  Persian-English Machine Translation: An Overview of the Shiraz Project , 2000 .

[12]  Mauro Cettolo,et al.  WIT3: Web Inventory of Transcribed and Translated Talks , 2012, EAMT.

[13]  Tomaž Erjavec,et al.  MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora , 2010, LREC 2010.

[14]  M. A. Farajian,et al.  PEN: Parallel English-Persian news corpus , 2011 .

[15]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[16]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.