Generalization of Hindi OCR Using Adaptive Segmentation and Font Files

In this chapter, we describe an adaptive Indic OCR system implemented as part of a rapidly retargetable language tool effort and extend work found in [20, 2]. The system includes script identification, character segmentation, training sample creation, and character recognition. For script identification, Hindi words are identified in bilingual or multilingual document images using features of the Devanagari script and support vector machine (SVM). Identified words are then segmented into individual characters, using a font-model-based intelligent character segmentation and recognition system. Using characteristics of structurally similar TrueType fonts, our system automatically builds a model to be used for the segmentation and recognition of the new script, independent of glyph composition. The key is a reliance on known font attributes. In our recognition system three feature extraction methods are used to demonstrate the importance of appropriate features for classification. The methods are tested on both Latin and non-Latin scripts. Results show that the character-level recognition accuracy exceeds 92% for non-Latin and 96% for Latin text on degraded documents. This work is a step toward the recognition of scripts of low-density languages which typically do not warrant the development of commercial OCR, yet often have complete TrueType font descriptions.

[1]  Gösta H. Granlund,et al.  Fourier Preprocessing for Hand Print Character Recognition , 1972, IEEE Transactions on Computers.

[2]  Nei Kato,et al.  A Handwritten Character Recognition System Using Directional Element Feature and Asymmetric Mahalanobis Distance , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Stephen V. Rice,et al.  Software tools and test data for research and testing of page-reading OCR systems , 2005, IS&T/SPIE Electronic Imaging.

[4]  Veena Bansal,et al.  Integrating knowledge sources in Devanagari text recognition system , 2000, IEEE Trans. Syst. Man Cybern. Part A.

[5]  Sargur N. Srihari,et al.  On-Line and Off-Line Handwriting Recognition: A Comprehensive Survey , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Merle Knight,et al.  The Oxford Hindi-English Dictionary , 1999 .

[7]  Sabri A. Mahmoud,et al.  Arabic character recognition using fourier descriptors and character contour encoding , 1994, Pattern Recognit..

[8]  Bidyut Baran Chaudhuri,et al.  An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi) , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[9]  A. G. Ramakrishnan,et al.  Optimal Feature Extraction for Bilingual OCR , 2002, Document Analysis Systems.

[10]  Alireza Khotanzad,et al.  Invariant Image Recognition by Zernike Moments , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Lawrence O'Gorman,et al.  Document image analysis: A bibliography , 1992, Machine Vision and Applications.

[12]  Lawrence O'Gorman,et al.  The Document Spectrum for Page Layout Analysis , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  David S. Doermann,et al.  Bootstrapping structured page segmentation , 2003, IS&T/SPIE Electronic Imaging.

[14]  David S. Doermann,et al.  Re-targetable OCR with Intelligent Character Segmentation , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.

[15]  Bidyut Baran Chaudhuri,et al.  Skew Angle Detection of Digitized Indian Script Documents , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Veena Bansal Integrating Knowledge Sources in Devanagari Text Recognition , 1999 .

[17]  David S. Doermann,et al.  Gabor filter based multi-class classifier for scanned document images , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[18]  Henry S. Baird,et al.  Language-free layout analysis , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[19]  David S. Doermann,et al.  Adaptive Hindi OCR using generalized Hausdorff image comparison , 2003, TALIP.

[20]  Claudio De Stefano,et al.  Handwritten Numeral Recognition by means of Evolutionary Algorithms , 1999, ICDAR.

[21]  Jonathan J. Hull Document Image skew Detection: Survey and Annotated Bibliography , 1996, DAS.

[22]  David S. Doermann,et al.  Word level script identification for scanned document images , 2003, IS&T/SPIE Electronic Imaging.

[23]  Jonathan J. Hull,et al.  Document Analysis Systems II - Second Workshop on Document Analysis Systems, DAS 1996, Malvern, PA, USA, October 14-16, 1996, Selected papers , 1998, Series in Machine Perception and Artificial Intelligence.

[24]  Veena Bansal,et al.  Segmentation of touching and fused Devanagari characters , 2002, Pattern Recognit..

[25]  Abdel Belaïd,et al.  Cross-learning in analytic word recognition without segmentation , 2002, International Journal on Document Analysis and Recognition.

[26]  Flávio Bortolozzi,et al.  The recognition of handwritten numeral strings using a two-stage HMM-based method , 2003, International Journal on Document Analysis and Recognition.

[27]  Bidyut Baran Chaudhuri,et al.  A Hybrid Scheme for Handprinted Numeral Recognition Based on a Self-Organizing Network and MLP Classifiers , 2002, Int. J. Pattern Recognit. Artif. Intell..

[28]  Eric Lecolinet,et al.  A Survey of Methods and Strategies in Character Segmentation , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[29]  M. Teague Image analysis via the general theory of moments , 1980 .

[30]  Zheru Chi,et al.  Handwritten numeral recognition using self-organizing maps and fuzzy rules , 1995, Pattern Recognit..

[31]  Philip A. Chou,et al.  Document Image Decoding Using Markov Source Models , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  George Paschos,et al.  Effective Arabic Character Recognition Using Support Vector Machines , 2007 .

[33]  M. Berthod,et al.  Automatic recognition of handprinted characters—The state of the art , 1980, Proceedings of the IEEE.

[34]  David S. Doermann,et al.  Adaptive OCR with limited user feedback , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[35]  Tarek M. Sobh,et al.  Innovations and Advanced Techniques in Computer and Information Sciences and Engineering , 2007 .