OCR in Bangla: an Indo-Bangladeshi language

In this paper a complete OCR system is described for documents of single Bangla (Bengali) font. The character shapes are recognized by a combination of template and feature matching approach. Images digitized by flatbed scanner are subjected to skew correction, line, word and character segmentation, simple and compound character separation, feature extraction and finally character recognition. A feature based tree classifier is used for simple character recognition. Preprocessing like thinning and skeletonization is not necessary in our scheme and hence the system is quite fast. At present, the system has an accuracy of about 96%. Also, some character occurrence statistics have been computed to model an error detection and correction technique in the near future.

[1]  Ching Y. Suen,et al.  n-Gram Statistics for Natural Language Understanding and Text Processing , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  M. Chandrasekaran,et al.  Computer Recognition of Tamil, Malayalam and Devanagari Characters , 1984 .

[3]  George Nagy,et al.  Recognition of Printed Chinese Characters , 1966, IEEE Trans. Electron. Comput..

[4]  R. Mahesh K. Sinha,et al.  Rule based contextual post-processing for devanagari text recognition , 1987, Pattern Recognit..

[5]  Ching Y. Suen,et al.  Large Tree Classifier with Heuristic Search and Global Training , 1987, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  V. K. Govindan,et al.  Character recognition - A review , 1990, Pattern Recognit..