An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi)

An OCR system is proposed that can read two Indian language scripts: Bangla and Devnagari (Hindi), the most popular ones in the Indian subcontinent. These scripts, having the same origin in ancient Brahmi script, have many features in common and hence a single system can be modeled to recognize them. In the proposed model, document digitization, skew detection, text line segmentation and zone separation, word and character segmentation, character grouping into basic, modifier and compound character category are done for both scripts by the same set of algorithms. The feature sets and classification tree as well as the knowledge base required for error correction (such as lexicon) differ for Bangla and Devnagari. The system shows a good performance for single font scripts printed on clear documents.

[1]  Ching Y. Suen,et al.  Historical review of OCR research and development , 1992, Proc. IEEE.

[2]  Bidyut Baran Chaudhuri,et al.  Skew Angle Detection of Digitized Indian Script Documents , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Jiangying Zhou,et al.  Page segmentation and classification , 1992, CVGIP Graph. Model. Image Process..

[4]  Santanu Chaudhury,et al.  Bengali alpha-numeric character recognition using curvature features , 1993, Pattern Recognit..

[5]  AZRIEL ROSENFELD,et al.  Digital Straight Line Segments , 1974, IEEE Transactions on Computers.

[6]  Sabri A. Mahmoud,et al.  Arabic character recognition using fourier descriptors and character contour encoding , 1994, Pattern Recognit..

[7]  Theodosios Pavlidis,et al.  On the Recognition of Printed Characters of Any Font and Size , 1987, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[9]  Jack D. Tubbs,et al.  A note on binary template matching , 1989, Pattern Recognit..

[10]  R. Mahesh K. Sinha,et al.  Rule based contextual post-processing for devanagari text recognition , 1987, Pattern Recognit..

[11]  Bidyut Baran Chaudhuri,et al.  Automatic separation of words in multi-lingual multi-script Indian documents , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[12]  Bidyut Baran Chaudhuri,et al.  OCR error detection and correction of an inflectional Indian language script , 1996, Proceedings of 13th International Conference on Pattern Recognition.