Chinese-Japanese Machine Translation Exploiting Chinese Characters

The Chinese and Japanese languages share Chinese characters. Since the Chinese characters in Japanese originated from ancient China, many common Chinese characters exist between these two languages. Since Chinese characters contain significant semantic information and common Chinese characters share the same meaning in the two languages, they can be quite useful in Chinese-Japanese machine translation (MT). We therefore propose a method for creating a Chinese character mapping table for Japanese, traditional Chinese, and simplified Chinese, with the aim of constructing a complete resource of common Chinese characters. Furthermore, we point out two main problems in Chinese word segmentation for Chinese-Japanese MT, namely, unknown words and word segmentation granularity, and propose an approach exploiting common Chinese characters to solve these problems. We also propose a statistical method for detecting other semantically equivalent Chinese characters other than the common ones and a method for exploiting shared Chinese characters in phrase alignment. Results of the experiments carried out on a state-of-the-art phrase-based statistical MT system and an example-based MT system show that our proposed approaches can improve MT performance significantly, thereby verifying the effectiveness of shared Chinese characters for Chinese-Japanese MT.

[1]  John DeNero,et al.  Tailoring Word Alignments to Syntactic Machine Translation , 2007, ACL.

[2]  Chenhui Chu,et al.  Japanese-Chinese Phrase Alignment Using Common Chinese Characters Information , 2011, MTSUMMIT.

[3]  Nianwen Xue,et al.  Developing Guidelines and Ensuring Consistency for Chinese Text Annotation , 2000, LREC.

[4]  Yuji Matsumoto,et al.  Applying Conditional Random Fields to Japanese Morphological Analysis , 2004, EMNLP.

[5]  Chu-Ren Huang,et al.  Multilingual Conceptual Access to Lexicon based on Shared Orthography: An ontology-driven study of Chinese and Japanese , 2008, COLING 2008.

[6]  Kentaro Torisawa,et al.  Adapting Chinese Word Segmentation for Machine Translation Based on Short Units , 2010, LREC.

[7]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[8]  Andrew McCallum,et al.  Chinese Segmentation and New Word Detection using Conditional Random Fields , 2004, COLING.

[9]  Yuji Matsumoto,et al.  Building a Japanese-Chinese Dictionary Using Kanji/Hanzi Conversion , 2005, IJCNLP.

[10]  Daisuke Kawahara,et al.  A Fully-Lexicalized Probabilistic Model for Japanese Syntactic and Case Structure Analysis , 2006, HLT-NAACL.

[11]  Hwee Tou Ng,et al.  A Maximum Entropy Approach to Chinese Word Segmentation , 2005, SIGHAN@IJCNLP 2005.

[12]  Chenhui Chu,et al.  Chinese Characters Mapping Table of Japanese, Traditional Chinese and Simplified Chinese , 2012, LREC.

[13]  Chew Lim Tan,et al.  Automatic Alignment of Japanese-Chinese Bilingual Texts , 1995, IEICE Trans. Inf. Syst..

[14]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[15]  Daniel Marcu,et al.  Cognates Can Improve Statistical Translation Models , 2003, NAACL.

[16]  Keh-Jiann Chen,et al.  Improving Word Alignment by Adjusting Chinese Word Segmentation , 2008, IJCNLP.

[17]  Sadao Kurohashi,et al.  EBMT System of KYOTO Team in PatentMT Task at NTCIR-9 , 2011, NTCIR.

[18]  Hermann Ney,et al.  Do We Need Chinese Word Segmentation for Statistical Machine Translation? , 2004, SIGHAN@ACL.

[19]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[20]  Jia-Fei Hong,et al.  The Extended Architecture of Hantology for Japan Kanji , 2008, LREC.

[21]  Christopher D. Manning,et al.  Optimizing Chinese Word Segmentation for Machine Translation Performance , 2008, WMT@ACL.

[22]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[23]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[24]  Adam Pease,et al.  Towards a standard upper ontology , 2001, FOIS.

[25]  Chu-Ren Huang,et al.  Hantology-A Linguistic Resource for Chinese Language Processing and Studying , 2006, LREC.

[26]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[27]  Chenhui Chu,et al.  Japanese-Chinese Phrase Alignment Exploiting Shared Chinese Characters , 2012 .

[28]  Hitoshi Isahara,et al.  Dependency Parsing with Short Dependency Relations in Unlabeled Data , 2008, IJCNLP.

[29]  KurohashiSadao,et al.  Chinese-Japanese Machine Translation Exploiting Chinese Characters , 2013 .

[30]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[31]  N. H. Beebe A Complete Bibliography of ACM Transactions on Asian Language Information Processing , 2007 .

[32]  Yoshimasa Tsuruoka,et al.  Improving Chinese Word Segmentation and POS Tagging with Semi-supervised Methods Using Large Auto-Analyzed Data , 2011, IJCNLP.

[33]  Sadao Kurohashi,et al.  Bayesian Subtree Alignment Model based on Dependency Trees , 2011, IJCNLP.

[34]  Yanjun Ma,et al.  Bilingually Motivated Domain-Adapted Word Segmentation for Statistical Machine Translation , 2009, EACL.

[35]  Chenhui Chu,et al.  Exploiting Shared Chinese Characters in Chinese Word Segmentation Optimization for Chinese-Japanese Machine Translation , 2012, EAMT.