论文信息 - Statistical Input Method based on a Phrase Class n-gram Model

Statistical Input Method based on a Phrase Class n-gram Model

We propose a method to construct a phrase class n-gram model for Kana-Kanji Conversion by combining phrase and class methods. We use a word-pronunciation pair as the basic prediction unit of the language model. We compared the conversion accuracy and model size of a phrase class bi-gram model constructed by our method to a tri-gram model. The conversion accuracy was measured by F measure and model size was measured by the vocabulary size and the number of non-zero frequency entries. The F measure of our phrase class bi-gram model was 90.41%, while that of a word-pronunciation pair tri-gram model was 90.21%. In addition, the vocabulary size and the number of non-zero frequency entries in the phrase class bi-gram model were 5,550 and 206,978 respectively, while those of the tri-gram model were 22,801 and 645,996 respectively. Thus our method makes a smaller, more accurate language model.

Shinsuke Mori | Hirokuni Maeta | Shinsuke Mori | Hirokuni Maeta

[1] Robert L. Mercer,et al. Class-Based n-gram Models of Natural Language , 1992, CL.

[2] Masafumi Nishimura,et al. Word clustering for a word bi-gram model , 1998, ICSLP.

[3] Kikuo Maekawa,et al. Balanced corpus of contemporary written Japanese , 2013, Language Resources and Evaluation.

[4] Hermann Ney,et al. Improved clustering techniques for class-based statistical language modelling , 1993, EUROSPEECH.

[5] Zheng Chen,et al. A New Statistical Approach To Chinese Pinyin Input , 2000, ACL.

[6] Alexander H. Waibel,et al. Class phrase models for language modeling , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[7] Frédéric Bimbot,et al. Language modeling by variable length sequences: theoretical formulation and evaluation of multigrams , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.