Re-targetable OCR with Intelligent Character Segmentation

We have developed a font-model based intelligent character segmentation and recognition system. Using characteristics of structurally similar TrueType fonts, our system automatically builds a model to be used for the segmentation and recognition of the new script, independent of glyph composition. The key is a reliance on known font attributes. In our system three feature extraction methods are used to demonstrate the importance of appropriate features for classification. The methods are tested on both Latin (English) and non-Latin (Khmer) scripts. Results show that the character-level recognition accuracy exceeds 92\% for Khmer and 96\% for English on degraded documents. This work is a step toward the recognition of scripts of low-density languages which typically do not warrant the development of commercial OCR, yet often have complete TrueType font descriptions.

[1]  Abdel Belaïd,et al.  Cross-learning in analytic word recognition without segmentation , 2002, International Journal on Document Analysis and Recognition.

[2]  David S. Doermann,et al.  Adaptive Hindi OCR using generalized Hausdorff image comparison , 2003, TALIP.

[3]  Sabri A. Mahmoud,et al.  Arabic character recognition using fourier descriptors and character contour encoding , 1994, Pattern Recognit..

[4]  Flávio Bortolozzi,et al.  The recognition of handwritten numeral strings using a two-stage HMM-based method , 2003, International Journal on Document Analysis and Recognition.

[5]  M. Berthod,et al.  Automatic recognition of handprinted characters—The state of the art , 1980, Proceedings of the IEEE.

[6]  Nei Kato,et al.  A Handwritten Character Recognition System Using Directional Element Feature and Asymmetric Mahalanobis Distance , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Bidyut Baran Chaudhuri,et al.  Indian script character recognition: a survey , 2004, Pattern Recognit..

[8]  Philip A. Chou,et al.  Document Image Decoding Using Markov Source Models , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  David S. Doermann,et al.  Adaptive OCR with limited user feedback , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[10]  Anil K. Jain,et al.  Feature extraction methods for character recognition-A survey , 1996, Pattern Recognit..

[11]  G CaseyRichard,et al.  A Survey of Methods and Strategies in Character Segmentation , 1996 .

[12]  Sargur N. Srihari,et al.  On-Line and Off-Line Handwriting Recognition: A Comprehensive Survey , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Claudio De Stefano,et al.  Handwritten Numeral Recognition by means of Evolutionary Algorithms , 1999, ICDAR.

[14]  Stephen V. Rice,et al.  Software tools and test data for research and testing of page-reading OCR systems , 2005, IS&T/SPIE Electronic Imaging.

[15]  George Paschos,et al.  Effective Arabic Character Recognition Using Support Vector Machines , 2007 .

[16]  Lawrence O'Gorman,et al.  Document Image Analysis , 1996 .

[17]  A. G. Ramakrishnan,et al.  Optimal Feature Extraction for Bilingual OCR , 2002, Document Analysis Systems.

[18]  Gösta H. Granlund,et al.  Fourier Preprocessing for Hand Print Character Recognition , 1972, IEEE Transactions on Computers.

[19]  Alireza Khotanzad,et al.  Invariant Image Recognition by Zernike Moments , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  Eric Lecolinet,et al.  A Survey of Methods and Strategies in Character Segmentation , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  M. Teague Image analysis via the general theory of moments , 1980 .

[22]  Zheru Chi,et al.  Handwritten numeral recognition using self-organizing maps and fuzzy rules , 1995, Pattern Recognit..

[23]  Gary E. Kopec,et al.  Document Image Decoding by Heuristic Search , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[24]  Bidyut Baran Chaudhuri,et al.  A Hybrid Scheme for Handprinted Numeral Recognition Based on a Self-Organizing Network and MLP Classifiers , 2002, Int. J. Pattern Recognit. Artif. Intell..

[25]  Lawrence O'Gorman,et al.  Document image analysis: A bibliography , 1992, Machine Vision and Applications.