Sentence alignment using hybrid model

Parallel corpora have become an essential resource for work in multilingual natural language processing. However, sentence aligned parallel corpora are more efficient than non-aligned parallel corpora for cross language information retrieval and machine translation applications. In this paper, we present a new approach to aligning sentences in bilingual parallel corpora based on the text character length between successive punctuates. A probabilistic score is assigned to each proposed correspondence of texts, based on the scaled difference of lengths of the two texts (in characters) and the variance of this difference. Using this score, the time required for punctuates matching decreased and the sentence alignment precision increased. Using this new approach, we could achieve 21.8% improvement over length based approach when applied on English-Arabic parallel documents.

[1]  RetrievalDouglas W. OardCollege Alternative Approaches for Cross-Language Text Retrieval , 1997 .

[2]  Jason S. Chang,et al.  A Class-based Approach to Word Alignment , 1997, CL.

[3]  Hsin-Hsi Chen,et al.  A Part-of-Speech-Based Alignment Algorithm , 1994, COLING.

[4]  Thomas C. Chuang,et al.  Aligning Parallel Bilingual Corpora Statistically with Punctuation Criteria , 2005, Int. J. Comput. Linguistics Chin. Lang. Process..

[5]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[6]  Pernilla Danielsson,et al.  Small but Efficient: The Misconception of High-Frequency Words in Scandinavian Translation , 2000, AMTA.

[7]  Nigel Collier,et al.  An Experiment in Hybrid Dictionary and Statistical Sentence Alignment , 1998, COLING-ACL.

[8]  G. Dias,et al.  Cognates alignment , 2001, MTSUMMIT.

[9]  William B. Dolan,et al.  MSR-MT: The Microsoft Research Machine Translation System , 2002, AMTA.

[10]  Fatiha Sadat,et al.  An Approach Based on Multilingual Thesauri and Model Combination for Bilingual Lexicon Extraction , 2002, COLING.

[11]  Michel Simard,et al.  Using cognates to align sentences in bilingual corpora , 1993, TMI.

[12]  Robert L. Mercer,et al.  Aligning Sentences in Parallel Corpora , 1991, ACL.

[13]  Fredric C. Gey,et al.  Translingual vocabulary mappings for multilingual information access , 2002, SIGIR '02.

[14]  Fredric C. Gey,et al.  Translation Term Weighting and Combining Translation Resources in Cross-Language Retrieval , 2001, TREC.

[15]  I. Dan Melamed A portable algorithm for mapping bitext correspondence , 1997 .

[16]  Robert C. Moore Fast and accurate sentence alignment of bilingual corpora , 2002, AMTA.

[17]  Christopher C. Yang,et al.  Building parallel corpora by automatic title alignment using length-based and text-based approaches , 2004, Inf. Process. Manag..

[18]  Michel Simard,et al.  Multialignement vs bialignement : à plusieurs, c’est mieux ! , 2015, JEPTALNRECITAL.

[19]  I. Dan Melamed,et al.  Bitext Maps and Alignment via Pattern Recognition , 1999, CL.

[20]  Éric Gaussier,et al.  Bilingual terminology extraction : an approach based on a multilingual thesaurus applicable to comparable corpora , 2002 .

[21]  Stanley F. Chen,et al.  Aligning Sentences in Bilingual Corpora Using Lexical Information , 1993, ACL.