Aligning Parallel Bilingual Corpora Statistically with Punctuation Criteria

We present a new approach to aligning sentences in bilingual parallel corpora based on punctuation, especially for English and Chinese. Although the length-based approach produces high accuracy rates of sentence alignment for clean parallel corpora written in two Western languages, such as French-English or German-English, it does not work as well for parallel corpora that are noisy or written in two disparate languages such as Chinese-English. It is possible to use cognates on top of the length-based approach to increase the alignment accuracy. However, cognates do not exist between two disparate languages, which limit the applicability of the cognate-based approach. In this paper, we examine the feasibility of exploiting the statistically ordered matching of punctuation marks in two languages to achieve high accuracy sentence alignment. We have experimented with an implementation of the proposed method on parallel corpora, the Chinese-English Sinorama Magazine Corpus and Scientific American Magazine articles, with satisfactory results. Compared with the length-based method, the proposed method exhibits better precision rates based on our experimental reuslts. Highly promising improvement was observed when both the punctuation-based and length-based methods were adopted within a common statistical framework. We also demonstrate that the method can be applied to other language pairs, such as English-Japanese, with minimal additional effort.

[1]  Dekai Wu,et al.  Aligning a Parallel English-Chinese Corpus Statistically With Lexical Criteria , 1994, ACL.

[2]  Keh-Yih Su,et al.  A Robust Cross-Style Bilingual Sentences Alignment Model , 2002, COLING.

[3]  Jason S. Chang,et al.  Bilingual Collocation Extraction Based on Syntactic and Statistical Analyses , 2003, ROCLING/IJCLCLP.

[4]  I. Dan Melamed,et al.  Bitext Maps and Alignment via Pattern Recognition , 1999, CL.

[5]  Éric Gaussier,et al.  Bilingual terminology extraction : an approach based on a multilingual thesaurus applicable to comparable corpora , 2002 .

[6]  Stanley F. Chen,et al.  Aligning Sentences in Bilingual Corpora Using Lexical Information , 1993, ACL.

[7]  Jean-Marc Jutras Rali An Automatic Reviser: The TransCheck System , 2000, ANLP.

[8]  Robert L. Mercer,et al.  Aligning Sentences in Parallel Corpora , 1991, ACL.

[9]  J. Richards,et al.  Longman Dictionary of Applied Linguistics , 1986 .

[10]  Jason S. Chang,et al.  Adaptive Bilingual Sentence Alignment , 2002, AMTA.

[11]  G. Dias,et al.  Cognates alignment , 2001, MTSUMMIT.

[12]  Michel Simard,et al.  Using cognates to align sentences in bilingual corpora , 1993, TMI.

[13]  Martin Kay,et al.  Text-Translation Alignment , 1993, Comput. Linguistics.

[14]  Christopher C. Yang,et al.  Automatic construction of English/Chinese parallel corpora , 2003, J. Assoc. Inf. Sci. Technol..

[15]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[16]  William B. Dolan,et al.  MSR-MT: The Microsoft Research Machine Translation System , 2002, AMTA.

[17]  Hsin-Hsi Chen,et al.  A Part-of-Speech-Based Alignment Algorithm , 1994, COLING.

[18]  Robert C. Moore Fast and accurate sentence alignment of bilingual corpora , 2002, AMTA.

[19]  Fredric C. Gey,et al.  Translingual vocabulary mappings for multilingual information access , 2002, SIGIR '02.

[20]  Fredric C. Gey,et al.  Translation Term Weighting and Combining Translation Resources in Cross-Language Retrieval , 2001, TREC.

[21]  Jason S. Chang,et al.  A Class-based Approach to Word Alignment , 1997, CL.

[22]  I. Dan Melamed A Portable Algorithm for Mapping Bitext Correspondence , 1997, ACL.

[23]  Nigel Collier,et al.  An Experiment in Hybrid Dictionary and Statistical Sentence Alignment , 1998, COLING-ACL.

[24]  Pernilla Danielsson,et al.  Small but Efficient: The Misconception of High-Frequency Words in Scandinavian Translation , 2000, AMTA.

[25]  Jason S. Chang,et al.  TotalRecall: A Bilingual Concordance for Computer Assisted Translation and Language Learning , 2003, ACL.