MIZAN: A Large Persian-English Parallel Corpus

One of the most major and essential tasks in natural language processing is machine translation that is now highly dependent upon multilingual parallel corpora. Through this paper, we introduce the biggest Persian-English parallel corpus with more than one million sentence pairs collected from masterpieces of literature. We also present acquisition process and statistics of the corpus, and experiment a base-line statistical machine translation system using the corpus.

[1]  Mohammad Sadegh Rasooli,et al.  Extracting Parallel Paragraphs and Sentences from English-Persian Translated Documents , 2011, AIRS.

[2]  Rémi Zajac,et al.  Persian-English Machine Translation: An Overview of the Shiraz Project , 2000 .

[3]  Tayebeh Mosavi Miangah Constructing a Large-Scale English-Persian Parallel Corpus , 2009 .

[4]  M. Utiyama,et al.  A Japanese-English patent parallel corpus , 2007, MTSUMMIT.

[5]  Christopher Cieri,et al.  Corpus Support for Machine Translation at LDC , 2006, LREC.

[6]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[7]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[8]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[9]  Marko Tadic Building the Croatian-English Parallel Corpus , 2000, LREC.

[10]  Heshaam Faili,et al.  TEP: Tehran English-Persian Parallel Corpus , 2011, CICLing.

[11]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[12]  Nizar Habash,et al.  Permission is granted to quote short excerpts and to reproduce figures and tables from this report, provided that the source of such material is fully acknowledged. Arabic Preprocessing Schemes for Statistical Machine Translation , 2006 .

[13]  Kenneth Heafield,et al.  KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[14]  Mohsen Sharifi,et al.  A novel string distance metric for ranking Persian respelling suggestions , 2012, Natural Language Engineering.

[15]  Behrouz Minaei-Bidgoli,et al.  Optimizing Document Similarity Detection in Persian Information Retrieval , 2010, J. Convergence Inf. Technol..

[16]  Masoud Rahgozar,et al.  Hamshahri: A standard Persian text collection , 2009, Knowl. Based Syst..

[17]  U. Germann Aligned Hansards of the 36th Parliament of Canada , 2001 .

[18]  Nizar Habash,et al.  Orthographic and Morphological Processing for Persian-to-English Statistical Machine Translation , 2013, IJCNLP.

[19]  Tomaz Erjavec,et al.  The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.