论文信息 - TEP: Tehran English-Persian Parallel Corpus

TEP: Tehran English-Persian Parallel Corpus

Parallel corpora are one of the key resources in natural language processing. In spite of their importance in many multi-lingual applications, no large-scale English-Persian corpus has been made available so far, given the difficulties in its creation and the intensive labors required. In this paper, the construction process of Tehran English-Persian parallel corpus (TEP) using movie subtitles, together with some of the difficulties we experienced during data extraction and sentence alignment are addressed. To the best of our knowledge, TEP has been the first freely released large-scale (in order of million words) English-Persian parallel corpus.

Heshaam Faili | Mohammad Taher Pilehvar | Abdol Hamid Pilehvar | H. Faili | Heshaam Faili

[1] Mahmood Bijankhan,et al. A Study of Corpus Development for Persian , 2010, Int. J. Asian Lang. Process..

[2] W ChurchKenneth,et al. A program for aligning sentences in bilingual corpora , 1993 .

[3] Mehdi Mohammadi,et al. Building Bilingual Parallel Corpora Based on Wikipedia , 2010, 2010 Second International Conference on Computer Engineering and Applications.

[4] Emmanuel Giguet,et al. Multilingual Aligned Corpora From Movie Subtitles , 2005 .

[5] Kenneth Ward Church,et al. A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[6] Shrikanth Narayanan,et al. AN ENGLISH-PERSIAN AUTOMATIC SPEECH TRANSLATOR: RECENT DEVELOPMENTS IN DOMAIN PORTABILITY AND USER MODELING , 2006 .

[7] Rémi Zajac,et al. Persian-English Machine Translation: An Overview of the Shiraz Project , 2000 .

[8] Karine Megerdoomian,et al. Persian Computational Morphology: A Unification-Based Approach , 2000 .

[9] Alon Itai,et al. Using Movie Subtitles for Creating a Large-Scale Bilingual Corpora , 2008, LREC.

[10] Jörg Tiedemann. Improved Sentence Alignment for Movie Subtitles , 2007 .

[11] Heshaam Faili,et al. PersianSMT: A first attempt to English-Persian statistical machine translation , 2010 .

[12] Tiejun Zhao,et al. Train the Machine with What It Can Learn - Corpus Selection for SMT , 2011, BUCC@ACL/IJCNLP.

[13] Tayebeh Mosavi Miangah. Constructing a Large-Scale English-Persian Parallel Corpus , 2009 .