An Ensemble Model of Word-based and Character-based Models for Japanese and Chinese Input Method

Since Japanese and Chinese languages have too many characters to be input directly using a standard keyboard, input methods for these languages that enable users to input the characters are required. Recently, input methods based on statistical models have become popular because of their accuracy and ease of maintenance. Most of them adopt word-based models because they utilize word-segmented corpora to train the models. However, such word-based models suffer from unknown words because they cannot convert words correctly which are not in corpora. To handle this problem, we propose a character-based model that enables input methods to convert unknown words by exploiting character-aligned corpora automatically generated by a monotonic alignment tool. In addition to the character-based model, we propose an ensemble model of both character-based and word-based models to achieve higher accuracy. The ensemble model combines these two models by linear interpolation. All of these models are based on joint source channel model to utilize rich context through higher order joint n-gram. Experiments on Japanese and Chinese datasets showed that the character-based model performs reasonably and the ensemble model outperforms the word-based baseline model. As a future work, the effectiveness of incorporating large raw data should be investigated.

[1]  Colin Cherry,et al.  Discriminative Substring Decoding for Transliteration , 2009, EMNLP.

[2]  Maosong Sun,et al.  CHIME: An Efficient Error-Tolerant Chinese Pinyin Input Method , 2011, IJCAI.

[3]  Andreas Stolcke,et al.  Entropy-based Pruning of Backoff Language Models , 2000, ArXiv.

[4]  Grzegorz Kondrak,et al.  Joint Processing and Discriminative Training for Letter-to-Phoneme Conversion , 2008, ACL.

[5]  Zheng Chen,et al.  A New Statistical Approach To Chinese Pinyin Input , 2000, ACL.

[6]  Kikuo Maekawa,et al.  Balanced corpus of contemporary written Japanese , 2013, Language Resources and Evaluation.

[7]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[8]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[9]  Jianfeng Gao,et al.  Toward a unified approach to statistical language modeling for Chinese , 2002, TALIP.

[10]  Jianfeng Gao,et al.  A Unified Approach to Transliteration-based Text Input with Online Spelling Correction , 2012, EMNLP.

[11]  Taro Watanabe,et al.  Machine Translation without Words through Substring Alignment , 2012, ACL.

[12]  Anthony McEnery,et al.  Aspect Marking in English and Chinese: Using the Lancaster Corpus of Mandarin Chinese for Contrastive Language Study , 2003, Lit. Linguistic Comput..

[13]  Shinsuke Mori,et al.  Phoneme-to-Text Transcription System with an Infinite Vocabulary , 2006, ACL.

[14]  Shinsuke Mori,et al.  Discriminative Method for Japanese Kana-Kanji Input Method , 2011, WTIM@IJCNLP.

[15]  Jian Su,et al.  A Joint Source-Channel Model for Machine Transliteration , 2004, ACL.

[16]  Min Zhang,et al.  Whitepaper of NEWS 2012 Shared Task on Machine Transliteration , 2011, NEWS@ACL.

[17]  Hisami Suzuki,et al.  Japanese Pronunciation Prediction as Phrasal Statistical Machine Translation , 2011, IJCNLP.

[18]  Kiyohiro Shikano,et al.  Unconstrained Many-to-Many Alignment for Automatic Pronunciation Annotation , 2011 .

[19]  Grzegorz Kondrak,et al.  Applying Many-to-Many Alignments and Hidden Markov Models to Letter-to-Phoneme Conversion , 2007, NAACL.

[20]  Eiichiro Sumita,et al.  Phrase-based Machine Transliteration , 2008, IJCNLP.