Japanese language model based on bigrams and its application to on-line character recognition

Abstract This paper deals with a postprocessing method based on the n -gram approach for Japanese character recognition. In Japanese a small number of phonetic characters (Kana) and thousands of Kanji characters, which are ideographs, are used for describing ordinary sentences. In other words, Japanese sentences not only have a large character set, but also include characters with different entropies. It is therefore difficult to apply conventional methodologies based on n -grams to postprocessing in Japanese character recognition. In order to resolve the above two difficulties, we propose a method that uses parts of speech in the following ways. One is to reduce the number of Kanji characters by clustering them according to the parts of speech that each Kanji character is used in. Another is to increase the entropy of a Kana character by classifying it into more detailed subcategories with part-of-speech attributes. We applied a bigram approach based on these two techniques to a Japanese language model. Experiments yielded the following two results: (1) our language model resolved the imbalance between Kana and Kanji characters, and reduced the perplexity of Japanese to less than 100, when Japanese newspaper texts (containing a total of approximately three million characters) were used for the learning of our model, and (2) the postprocessing using the model for on-line character recognition rectified about half of all substitution errors when the correct characters were among the candidates.