Linguistic-Relationships-Based Approach for Improving Word Alignment

The unsupervised word alignments (such as GIZA++) are widely used in the phrase-based statistical machine translation. The quality of the model is proportional to the size and the quality of the bilingual corpus. However, for low-resource language pairs such as Chinese and Vietnamese, a result of unsupervised word alignment sometimes is of low quality due to the sparse data. In addition, this model does not take advantage of the linguistic relationships to improve performance of word alignment. Chinese and Vietnamese have the same language type and have close linguistic relationships. In this article, we integrate the characteristics of linguistic relationships into the word alignment model to enhance the quality of Chinese-Vietnamese word alignment. These linguistic relationships are Sino-Vietnamese and content word. The experimental results showed that our method improved the performance of word alignment as well as the quality of machine translation.

[1]  Dinh Dien,et al.  A maximum entropy approach for vietnamese word segmentation , 2006, 2006 International Conference onResearch, Innovation and Vision for the Future.

[2]  Pushpak Bhattacharyya,et al.  Leveraging Small Multilingual Corpora for SMT Using Many Pivot Languages , 2015, NAACL.

[3]  Chenhui Chu,et al.  Japanese-Chinese Phrase Alignment Using Common Chinese Characters Information , 2011, MTSUMMIT.

[4]  Nadir Durrani,et al.  Improving machine translation via triangulation and transliteration , 2014, EAMT.

[5]  Sivaji Bandyopadhyay,et al.  A Hybrid Word Alignment Model for Phrase-Based Statistical Machine Translation , 2013, HyTra@ACL.

[6]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[7]  Xiaolin Wang,et al.  Refining Word Segmentation Using a Manually Aligned Corpus for Statistical Machine Translation , 2014, EMNLP.

[8]  Theerawat Songyot,et al.  Improving Word Alignment using Word Similarity , 2014, EMNLP.

[9]  Hermann Ney,et al.  HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[10]  Miles Osborne,et al.  Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[11]  F. Pellegrino,et al.  A Quantitative and Typological Approach to Correlating Linguistic Complexity , 2013 .

[12]  Ben Taskar,et al.  Better Alignments = Better Translations? , 2008, ACL.

[13]  Hwee Tou Ng,et al.  Improved Statistical Machine Translation for Resource-Poor Languages Using Related Resource-Rich Languages , 2009, EMNLP.

[14]  David Chiang,et al.  Multi-Task Word Alignment Triangulation for Low-Resource Languages , 2015, HLT-NAACL.

[15]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[16]  Hua Wu,et al.  Pivot language approach for phrase-based statistical machine translation , 2007, ACL.