An optical character recognition system for printed Telugu text

Telugu is one of the oldest and popular languages of India, spoken by more than 66 million people, especially in South India. Not much work has been reported on the development of optical character recognition (OCR) systems for Telugu text. Therefore, it is an area of current research. Some characters in Telugu are made up of more than one connected symbol. Compound characters are written by associating modifiers with consonants, resulting in a huge number of possible combinations, running into hundreds of thousands. A compound character may contain one or more connected symbols. Therefore, systems developed for documents of other scripts, like Roman, cannot be used directly for the Telugu language.The individual connected portions of a character or a compound character are defined as basic symbols in this paper and treated as a unit of recognition. The algorithms designed exploit special characteristics of Telugu script for processing the document images efficiently. The algorithms have been implemented to create a Telugu OCR system for printed text (TOSP). The output of TOSP is in phonetic English that can be transliterated to generate editable Telugu text. A special feature of TOSP is that it is designed to handle a large variety of sizes and multiple fonts, and still provides raw OCR accuracy of nearly 98%. The phonetic English representation can be also used to develop a Telugu text-to-speech system; work is in progress in this regard.

[1]  Bidyut Baran Chaudhuri,et al.  Segmentation of Bangla handwritten text into characters by recursive contour following , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[2]  Hong Yan,et al.  Skew Correction of Document Images Using Interline Cross-Correlation , 1993, CVGIP Graph. Model. Image Process..

[3]  Bidyut Baran Chaudhuri,et al.  A complete printed Bangla OCR system , 1998, Pattern Recognit..

[4]  Ali Zilouchian,et al.  FUNDAMENTALS OF NEURAL NETWORKS , 2001 .

[5]  Bidyut Baran Chaudhuri,et al.  Script line separation from Indian multi-script documents , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[6]  K. GovindanV.,et al.  Character recognitiona review , 1990 .

[7]  Harry Wechsler,et al.  Automated page orientation and skew angle detection for binary document images , 1994, Pattern Recognit..

[8]  Veena Bansal,et al.  On how to describe shapes of Devanagari characters and use them for recognition , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[9]  Eric Lecolinet,et al.  A Survey of Methods and Strategies in Character Segmentation , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  V. K. Govindan,et al.  Character recognition - A review , 1990, Pattern Recognit..

[11]  Laurene V. Fausett,et al.  Fundamentals Of Neural Networks , 1994 .

[12]  Atul Negi,et al.  An OCR system for Telugu , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[13]  Norihiro Hagita,et al.  Automated entry system for printed documents , 1990, Pattern Recognit..

[14]  Sameer Antani,et al.  Gujarati character recognition , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[15]  Jiangying Zhou,et al.  Page segmentation and classification , 1992, CVGIP Graph. Model. Image Process..

[16]  M. B. Sukhaswami,et al.  Recognition of telugu characters using neural networks , 1995, Int. J. Neural Syst..

[17]  George Nagy,et al.  Twenty Years of Document Image Analysis in PAMI , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  S. Sathiya Keerthi,et al.  A study of representations for pen based handwriting recognition of Tamil characters , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[19]  Milan Sonka,et al.  Image Processing, Analysis and Machine Vision , 1993, Springer US.

[20]  Milan Sonka,et al.  Image processing analysis and machine vision [2nd ed.] , 1999 .

[21]  Sargur N. Srihari,et al.  Gradient-based contour encoding for character recognition , 1996, Pattern Recognit..

[22]  Ching Y. Suen,et al.  Historical review of OCR research and development , 1992, Proc. IEEE.