Improving Word Alignment by Adjusting Chinese Word Segmentation

Most of the current Chinese word alignment tasks often adopt word segmentation systems firstly to identify words. However, word-mismatching problems exist between languages and will degrade the performance of word alignment. In this paper, we propose two unsupervised methods to adjust word segmentation to make the tokens 1-to-1 mapping as many as possible between the corresponding sentences. The first method is learning affix rules from a bilingual terminology bank. The second method is using the concept of impurity measure motivated by the decision tree. Our experiments showed that both of the adjusting methods improve the performance of word alignment significantly.

[1]  David G. Stork,et al.  Pattern Classification , 1973 .

[2]  Keh-Jiann Chen,et al.  Unknown Word Detection for Chinese by a Corpus-based Learning Method , 1998, ROCLING/IJCLCLP.

[3]  Young-Suk Lee,et al.  Morphological Analysis for Statistical Machine Translation , 2004, NAACL.

[4]  Changning Huang,et al.  Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach , 2005, CL.

[5]  Sharon Goldwater,et al.  Improving Statistical MT through Morphological Analysis , 2005, HLT.

[6]  Dekai Wu,et al.  Learning an English-Chinese Lexicon from a Parallel Corpus , 1994, AMTA.

[7]  Keh-Jiann Chen,et al.  Word Identification for Mandarin Chinese Sentences , 1992, COLING.

[8]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[9]  Jason S. Chang,et al.  統計式片語翻譯模型 (Statistical Translation Model for Phrases) [In Chinese] , 2001 .

[10]  Necip Fazil Ayan,et al.  Going Beyond AER: An Extensive Analysis of Word Alignments and Their Impact on MT , 2006, ACL.

[11]  Jason S. Chang,et al.  Statistical Translation Model for Phrases , 2001, Int. J. Comput. Linguistics Chin. Lang. Process..

[12]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[13]  Hermann Ney,et al.  Improved Statistical Alignment Models , 2000, ACL.

[14]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[15]  Dekai Wu,et al.  Stochastic Inversion Transduction Grammars and Bilingual Parsing of Parallel Corpora , 1997, CL.

[16]  Ben Taskar,et al.  Alignment by Agreement , 2006, NAACL.

[17]  William J. Byrne,et al.  HMM Word and Phrase Alignment for Statistical Machine Translation , 2005, HLT.

[18]  Ossama Emam,et al.  Language Model Based Arabic Word Segmentation , 2003, ACL.

[19]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[20]  John DeNero,et al.  Tailoring Word Alignments to Syntactic Machine Translation , 2007, ACL.

[21]  EstimationPeter,et al.  The Mathematics of Machine Translation : Parameter , 2004 .

[22]  Hermann Ney,et al.  HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[23]  Yanjun Ma,et al.  Bootstrapping Word Alignment via Word Packing , 2007, ACL.

[24]  Keh-Jiann Chen,et al.  A Bottom-up Merging Algorithm for Chinese Unknown Word Extraction , 2003, SIGHAN.

[25]  Keh-Jiann Chen,et al.  Unknown Word Extraction for Chinese Documents , 2002, COLING.

[26]  Keh-Jiann Chen,et al.  利用雙語學術名詞庫抽取中英字詞互譯及詞義解歧 (Sense Extraction and Disambiguation for Chinese Words from Bilingual Terminology Bank) [In Chinese] , 2006, ROCLING/IJCLCLP.

[27]  Robert C. Moore Improving IBM Word Alignment Model 1 , 2004, ACL.