Extracting Parallel Paragraphs and Sentences from English-Persian Translated Documents

The task of sentence and paragraph alignment is essential for preparing parallel texts that are needed in applications such as machine translation. The lack of sufficient linguistic data for under-resourced languages like Persian is a challenging issue. In this paper, we proposed a hybrid sentence and paragraph alignment model on Persian-English parallel documents based on simple linguistic features as well as length similarity between sentences and paragraphs of source and target languages. We apply a small bilingual dictionary of Persian-English nouns, punctuation marks, and length similarity as alignment metrics. We combine these features in a linear model and use genetic algorithm to learn the linear equation weights. Evaluation results show that the extracted features improve the baseline model which is only a length-based one.

[1]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[2]  Martin Kay,et al.  Text-Translation Alignment , 1993, Comput. Linguistics.

[3]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[4]  José João Almeida,et al.  Parallel Corpora based Translation Resources Extraction , 2007, Proces. del Leng. Natural.

[5]  Shingo Kuroiwa,et al.  Sentence alignment using P-NNT and GMM , 2007, Comput. Speech Lang..

[6]  Jörg Tiedemann Improved Sentence Alignment for Movie Subtitles , 2007 .

[7]  Jörg Tiedemann Synchronizing Translated Movie Subtitles , 2008, LREC.

[8]  Eduard Hovy,et al.  Machine Translation and the Information Soup , 2002, Lecture Notes in Computer Science.

[9]  Ergun Biçici,et al.  Context-Based Sentence Alignment in Parallel Corpora , 2008, CICLing.

[10]  Robert C. Moore Fast and accurate sentence alignment of bilingual corpora , 2002, AMTA.

[11]  Lei Shi,et al.  Improved Sentence Alignment on Parallel Web Pages Using a Stochastic Tree Alignment Model , 2008, EMNLP.

[12]  Stanley F. Chen,et al.  Aligning Sentences in Bilingual Corpora Using Lexical Information , 1993, ACL.

[13]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[14]  Tayebeh Mosavi Miangah Constructing a Large-Scale English-Persian Parallel Corpus , 2009 .

[15]  Ralph Grishman,et al.  A Multilingual Procedure for Dictionary-Based Sentence Alignment , 1998, AMTA.

[16]  Dekai Wu,et al.  Aligning a Parallel English-Chinese Corpus Statistically With Lexical Criteria , 1994, ACL.

[17]  Arul Menezes,et al.  A best-first alignment algorithm for automatic extraction of transfer mappings from bilingual corpora , 2001, DDMMT@ACL.

[18]  Michel Simard,et al.  Using cognates to align sentences in bilingual corpora , 1993, TMI.

[19]  I. Dan Melamed,et al.  A Geometric Approach to Mapping Bitext Correspondence , 1996, EMNLP.

[20]  Philip Resnik,et al.  The Bible as a Parallel Corpus: Annotating the ‘Book of 2000 Tongues’ , 1999, Comput. Humanit..

[21]  Rico Sennrich,et al.  MT-based Sentence Alignment for OCR-generated Parallel Texts , 2010, AMTA.

[22]  Masahiko Haruno,et al.  High-performance bilingual text alignment using statistical and dictionary information , 1997, Nat. Lang. Eng..

[23]  Oi Yee Kwong,et al.  Natural Language Processing - IJCNLP 2004, First International Joint Conference, Hainan Island, China, March 22-24, 2004, Revised Selected Papers , 2005, IJCNLP.

[24]  Rico Sennrich,et al.  Iterative, MT-based Sentence Alignment of Parallel Texts , 2011, NODALIDA.

[25]  Stephen D. Richardson Machine Translation: From Research to Real Users , 2002, Lecture Notes in Computer Science.

[26]  Mrityunjay Gautam,et al.  A Hybrid Approach to Sentence Alignment Using Genetic Algorithm , 2007, 2007 International Conference on Computing: Theory and Applications (ICCTA'07).

[27]  Heshaam Faili,et al.  TEP: Tehran English-Persian Parallel Corpus , 2011, CICLing.

[28]  Xiaoyi Ma,et al.  Champollion: A Robust Parallel Text Sentence Aligner , 2006, LREC.

[29]  Maosong Sun,et al.  Fast-Champollion: A Fast and Robust Sentence Alignment Algorithm , 2010, COLING.

[30]  Fabienne Braune,et al.  Improved Unsupervised Sentence Alignment for Symmetrical and Asymmetrical Parallel Corpora , 2010, COLING.

[31]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[32]  Jörg Tiedemann,et al.  Bitext Alignment , 2011, Synthesis Lectures on Human Language Technologies.

[33]  Bhuvana Ramabhadran,et al.  Iterative sentence-pair extraction from quasi-parallel corpora for machine translation , 2009, INTERSPEECH.

[34]  Kristina Toutanova,et al.  Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment , 2010, NAACL.

[35]  Ana Frankenberg-Garcia Compiling and using a parallel corpus for research in translation , 2009 .

[36]  Robert L. Mercer,et al.  Aligning Sentences in Parallel Corpora , 1991, ACL.

[37]  Jörg Tiedemann Recycling Translations : Extraction of Lexical Data from Parallel Corpora and their Application in Natural Language Processing , 2003 .

[38]  Shankar Kumar,et al.  Segmentation and alignment of parallel text for statistical machine translation , 2006, Natural Language Engineering.

[39]  Jason S. Chang,et al.  Bilingual Sentence Alignment Based on Punctuation Statistics and Lexicon , 2004, IJCNLP.