Localization, extraction and recognition of text in Telugu document images

In this paper we present a system to locate, extract andrecognize Telugu text. The circular nature of Telugu scriptis exploited for segmenting text regions using the HoughTransform. First, the Hough Transform for circles is performedon the Sobel gradient magnitude of the image tolocate text. The located circles are filled to yield text regions,followed by Recursive XY Cuts to segment the regionsinto paragraphs, lines and word regions. A regionmerging process with a bottom-up approach envelopes individualwords. Local binarization of the word MBRs yieldsconnected components containing glyphs for recognition.The recognition process first identifies candidate charactersby a zoning technique and then constructs structural featurevectors by cavity analysis. Finally, if required, crossingcount based non-linear normalization and scaling is performedbefore template matching. The segmentation processsucceeds in extracting text from images with complexNon-Manhattan layouts. The recognition process gave acharacter recognition accuracy of 97%-98%.

[1]  Edward M. Riseman,et al.  TextFinder: An Automatic System to Detect and Recognize Text In Images , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Seong-Whan Lee,et al.  Nonlinear shape normalization methods for the recognition of large-set handwritten characters , 1994, Pattern Recognit..

[3]  Proceedings Seventh International Conference on Document Analysis and Recognition , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[4]  Atul Negi,et al.  An OCR system for Telugu , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[5]  Luc Vincent,et al.  Pink Panther: A Complete Environment For Ground-Truthing And Benchmarking Document Page Segmentation , 1998, Pattern Recognit..

[6]  George Nagy,et al.  HIERARCHICAL REPRESENTATION OF OPTICALLY SCANNED DOCUMENTS , 1984 .