A complete printed Bangla OCR system

A complete Optical Character Recognition (OCR) system for printed Bangla, the fourth most popular script in the world, is presented. This is the first OCR system among all script forms used in the Indian sub-continent. The problem is difficult because (i) there are about 300 basic, modified and compound character shapes in the script, (ii) the characters in a word are topologically connected and (iii) Bangla is an inflectional language. In our system the document image captured by Flat-bed scanner is subject to skew correction, text graphics separation, line segmentation, zone detection, word and character segmentation using some conventional and some newly developed techniques. From zonal information and shape characteristics, the basic, modified and compound characters are separated for the convenience of classification. The basic and modified characters which are about 75 in number and which occupy about 96% of the text corpus, are recognized by a structural-feature-based tree classifier. The compound characters are recognized by a tree classifier followed by template-matching approach. The feature detection is simple and robust where preprocessing like thinning and pruning are avoided. The character unigram statistics is used to make the tree classifier efficient. Several heuristics are also used to speed up the template matching approach. A dictionary-based error-correction scheme has been used where separate dictionaries are compiled for root word and suffixes that contain morpho-syntactic informations as well. For single font clear documents 95.50% word level (which is equivalent to 99.10% character level) recognition accuracy has been obtained. Extension of the work to Devnagari, the third most popular script in the world, is also discussed.

[1]  Nobuyasu Itoh,et al.  A spelling correction method and its application to an OCR system , 1990, Pattern Recognit..

[2]  Friedrich M. Wahl,et al.  Document Analysis System , 1982, IBM J. Res. Dev..

[3]  Norihiro Hagita,et al.  Automated entry system for printed documents , 1990, Pattern Recognit..

[4]  Andrew K. C. Wong,et al.  A new method for gray-level picture thresholding using the entropy of the histogram , 1985, Comput. Vis. Graph. Image Process..

[5]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[6]  W. D. Buckingham Automatic reading machine for telegraph service , 1963, AFIPS '63 (Spring).

[7]  Hong Yan,et al.  Skew Correction of Document Images Using Interline Cross-Correlation , 1993, CVGIP Graph. Model. Image Process..

[8]  Theodosios Pavlidis,et al.  On the Recognition of Printed Characters of Any Font and Size , 1987, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[10]  V. K. Govindan,et al.  Character recognition - A review , 1990, Pattern Recognit..

[11]  Azriel Rosenfeld,et al.  Digital Picture Processing , 1976 .

[12]  Jack D. Tubbs,et al.  A note on binary template matching , 1989, Pattern Recognit..

[13]  J. Mantas,et al.  An overview of character recognition methodologies , 1986, Pattern Recognit..

[14]  Harry Wechsler,et al.  Automated page orientation and skew angle detection for binary document images , 1994, Pattern Recognit..

[15]  George Nagy,et al.  An Autonomous Reading Machine , 1968, IEEE Transactions on Computers.

[16]  Rangachar Kasturi,et al.  A Robust Algorithm for Text String Separation from Mixed Text/Graphics Images , 1988, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  G. SIROMONEY,et al.  Computer recognition of printed Tamil characters , 1978, Pattern Recognit..

[18]  Yi Lu,et al.  Machine printed character segmentation --; An overview , 1995, Pattern Recognit..

[19]  R. Mahesh K. Sinha,et al.  Rule based contextual post-processing for devanagari text recognition , 1987, Pattern Recognit..

[20]  Ching Y. Suen,et al.  Historical review of OCR research and development , 1992, Proc. IEEE.

[21]  Lawrence O'Gorman,et al.  The Document Spectrum for Page Layout Analysis , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Jiangying Zhou,et al.  Page segmentation and classification , 1992, CVGIP Graph. Model. Image Process..

[23]  Santanu Chaudhury,et al.  Bengali alpha-numeric character recognition using curvature features , 1993, Pattern Recognit..

[24]  Sargur N. Srihari,et al.  Off-Line Cursive Script Word Recognition , 1989, IEEE Trans. Pattern Anal. Mach. Intell..