Building Monolingual Word Alignment Corpus for the Greater China Region

For a single semantic meaning, various linguistic expressions exist the Mainland China, Hong Kong and Taiwan variety of Mandarin Chinese, a.k.a., the Greater China Region (GCR). Differing from the current bilingual word alignment corpus, in this paper, we have constructed two monolingual GCR corpora. One is a 11,623-triple GCR word dictionary corpora which is automatically extracted and manually annotated from 30 million sentence pairs from Wikipedia. The other one is a manually annotated 12,000 sentence pairs GCR word alignment corpus from Wikipedia and news website. In addition, we present a rulebased word alignment model which systematically explores the different word alignment case, e.g. 1-1, 1-n and m-n mapping, from Mainland China to Hong Kong or Taiwan. Evaluation results on our two different GCR word alignment corpora verify the effectiveness of our model, which significantly outperforms the current Hidden Markov Model (HMM) based method, GIZA++ and their enhanced versions.

[1]  Yuji Matsumoto,et al.  Hidden Markov Tree Model for Word Alignment , 2013, WMT@ACL.

[2]  Taro Watanabe,et al.  Recurrent Neural Networks for Word Alignment Model , 2014, ACL.

[3]  John DeNero,et al.  Tailoring Word Alignments to Syntactic Machine Translation , 2007, ACL.

[4]  Ted Pedersen,et al.  An Evaluation Exercise for Word Alignment , 2003, ParallelTexts@NAACL-HLT.

[5]  John DeNero,et al.  A Constrained Viterbi Relaxation for Bidirectional Word Alignment , 2014, ACL.

[6]  Nenghai Yu,et al.  Word Alignment Modeling with Context Dependent Deep Neural Network , 2013, ACL.

[7]  Hermann Ney,et al.  HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[8]  Ben Taskar,et al.  A Discriminative Matching Approach to Word Alignment , 2005, HLT.

[9]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[10]  Phil Blunsom,et al.  Discriminative Word Alignment with Conditional Random Fields , 2006, ACL.

[11]  Ben Taskar,et al.  Alignment by Agreement , 2006, NAACL.

[12]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[13]  G. Clark,et al.  Reference , 2008 .

[14]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[15]  Eiichiro Sumita,et al.  Toward a Broad-coverage Bilingual Corpus for Speech Translation of Travel Conversations in the Real World , 2002, LREC.

[16]  Taro Watanabe,et al.  An Unsupervised Model for Joint Phrase Alignment and Extraction , 2011, ACL.

[17]  Robert C. Moore A Discriminative Framework for Bilingual Word Alignment , 2005, HLT.

[18]  Necip Fazil Ayan,et al.  Going Beyond AER: An Extensive Analysis of Word Alignments and Their Impact on MT , 2006, ACL.

[19]  Theerawat Songyot,et al.  Improving Word Alignment using Word Similarity , 2014, EMNLP.

[20]  John DeNero,et al.  Model-Based Aligner Combination Using Dual Decomposition , 2011, ACL.

[21]  V. Chvátal,et al.  Longest common subsequences of two random sequences , 1975, Advances in Applied Probability.