论文信息 - Japanese language model based on bigrams and its application to on-line character recognition

Japanese language model based on bigrams and its application to on-line character recognition

Abstract This paper deals with a postprocessing method based on the n -gram approach for Japanese character recognition. In Japanese a small number of phonetic characters (Kana) and thousands of Kanji characters, which are ideographs, are used for describing ordinary sentences. In other words, Japanese sentences not only have a large character set, but also include characters with different entropies. It is therefore difficult to apply conventional methodologies based on n -grams to postprocessing in Japanese character recognition. In order to resolve the above two difficulties, we propose a method that uses parts of speech in the following ways. One is to reduce the number of Kanji characters by clustering them according to the parts of speech that each Kanji character is used in. Another is to increase the entropy of a Kana character by classifying it into more detailed subcategories with part-of-speech attributes. We applied a bigram approach based on these two techniques to a Japanese language model. Experiments yielded the following two results: (1) our language model resolved the imbalance between Kana and Kanji characters, and reduced the perplexity of Japanese to less than 100, when Japanese newspaper texts (containing a total of approximately three million characters) were used for the learning of our model, and (2) the postprocessing using the model for on-line character recognition rectified about half of all substitution errors when the correct characters were among the candidates.

NOBUYASU ITOH

[1] Charles C. Tappert,et al. Online recognizer for runon handprinted characters , 1990, [1990] Proceedings. 10th International Conference on Pattern Recognition.

[2] Claude E. Shannon,et al. Prediction and Entropy of Printed English , 1951 .

[3] Hiroshi Maruyama,et al. A method of detecting and correcting errors in the results of Japanese OCR (abstract) , 1992 .

[4] Hiroshi Murase. Online recognition of free-format Japanese handwritings , 1988, [1988 Proceedings] 9th International Conference on Pattern Recognition.

[5] Masayuki Fujita,et al. An Approach to Integrated Pen Interface for Japanese Text Entry (Special Issue on Document Analysis and Recognition) , 1994 .

[6] Stephen E. Levinson,et al. Computing relative redundancy to measure grammatical constraint in speech recognition tasks , 1978, ICASSP.

[7] Eiichiro Sumita,et al. A Japanese Sentence Analyzer , 1988, IBM J. Res. Dev..

[8] Nobuyasu Itoh,et al. DRS: a workstation-based document recognition system for text entry , 1992, Computer.