Chinese-Japanese Clause Alignment

Bi-text alignment is useful to many Natural Language Processing tasks such as machine translation, bilingual lexicography and word sense disambiguation. This paper presents a Chinese-Japanese alignment at the level of clause. After describing some characteristics in Chinese-Japanese bilingual texts, we first investigate some statistical properties of Chinese-Japanese bilingual corpus, including the correlation test of text lengths between two languages and the distribution test of length ratio data. We then pay more attention to n-m(n>1 or m>1) alignment modes which are prone to mismatch. We propose a similarity measure based on Hanzi characters information for these kinds of alignment modes. By using dynamic programming, we combine statistical information and Hanzi character information to find the overall least cost in aligning. Experiments show our algorithm can achieve good alignment accuracy.

[1]  Robert L. Mercer,et al.  Aligning Sentences in Parallel Corpora , 1991, ACL.

[2]  Alexander H. Waibel,et al.  Effective Phrase Translation Extraction from Alignment Models , 2003, ACL.

[3]  Chew Lim Tan,et al.  Automatic Alignment of Japanese-Chinese Bilingual Texts , 1995, IEICE Trans. Inf. Syst..

[4]  Dekai Wu,et al.  Stochastic Inversion Transduction Grammars and Bilingual Parsing of Parallel Corpora , 1997, CL.

[5]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[6]  Chunyu Kit,et al.  Clause alignment for Hong Kong legal texts: A lexical-based approach , 2004 .

[7]  Dekai Wu,et al.  Aligning a Parallel English-Chinese Corpus Statistically With Lexical Criteria , 1994, ACL.

[8]  Hiroyuki Kaji,et al.  Learning Translation Templates From Bilingual Text , 1992, COLING.

[9]  Oi Yee Kwong,et al.  Natural Language Processing - IJCNLP 2004, First International Joint Conference, Hainan Island, China, March 22-24, 2004, Revised Selected Papers , 2005, IJCNLP.

[10]  Jean Véronis,et al.  Parallel Text Processing , 2000 .

[11]  Philip Resnik,et al.  An Unsupervised Method for Word Sense Tagging using Parallel Corpora , 2002, ACL.

[12]  I. Dan Melamed,et al.  Pattern recognition for mapping bitext correspondence , 2000 .

[13]  Yuan Ding,et al.  Automatic Learning of Parallel Dependency Treelet Pairs , 2004, IJCNLP.

[14]  Martin Kay,et al.  Text-Translation Alignment , 1993, Comput. Linguistics.

[15]  Yuji Matsumoto,et al.  Sructural Matching of Parallel Texts , 1993, ACL.

[16]  Jörg Tiedemann Recycling Translations : Extraction of Lexical Data from Parallel Corpora and their Application in Natural Language Processing , 2003 .