Discriminative learning for script recognition

Document script recognition is one of the important preprocessing steps in a multilingual optical character recognition (MOCR) system. A MOCR system requires prior knowledge of script to accurately recognize multilingual text in a single document. In multilingual documents two scripts can be mixed together within a single text line. Many existing script recognition methods lack the ability to recognize multiple scripts mixed within a single text line. Besides, these methods usually use script dependent features for script recognition thereby limiting their scope to particularly that script. In this paper we propose a discriminative learning approach for multi-script recognition at connected component level by using a convolutional neural network. The convolutional neural network combines feature extraction and script recognition process in one step and discriminative features for script recognition are extracted and learned as convolutional kernels from raw input. This eliminates the need for manually defining discriminative features for particular scripts. Results show above 95% script recognition accuracy at connected component level on datasets of Greek-Latin, Arabic-Latin multi-script documents and Antiqua-Fraktur documents. The proposed method can be easily adapted to different scripts.

[1]  Patrick Kelly,et al.  Automatic Script Identification From Document Images Using Cluster-Based Templates , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Yue Lu,et al.  Bangla/English Script Identification Based on Analysis of Connected Component Profiles , 2006, Document Analysis Systems.

[3]  Jayanthi Sivaswamy,et al.  Script Identification from Indian Documents , 2006, Document Analysis Systems.

[4]  S. Abirami,et al.  A Survey of Script Identification techniques for Multi-Script Document Images , 2009 .

[5]  Thomas M. Breuel,et al.  Efficient implementation of local adaptive thresholding techniques using integral images , 2008, Electronic Imaging.

[6]  Sridha Sridharan,et al.  Texture for script identification , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[8]  A. G. Ramakrishnan,et al.  Word level multi-script identification , 2008, Pattern Recognit. Lett..

[9]  Isabelle Guyon,et al.  DATA SETS FOR OCR AND DOCUMENT IMAGE UNDERSTANDING RESEARCH , 1997 .

[10]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[11]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[12]  David S. Doermann,et al.  Word level script identification for scanned document images , 2003, IS&T/SPIE Electronic Imaging.

[13]  A. Lawrence Spitz,et al.  Determination of the Script and Language Content of Document Images , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[15]  Thomas M. Breuel,et al.  Document cleanup using page frame detection , 2008, International Journal of Document Analysis and Recognition (IJDAR).

[16]  Bidyut Baran Chaudhuri,et al.  Script line separation from Indian multi-script documents , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[17]  Thomas M. Breuel,et al.  Performance Evaluation and Benchmarking of Six-Page Segmentation Algorithms , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Syed Saqib Bukhari,et al.  A discriminative learning approach for orientation detection of Urdu document images , 2009, 2009 IEEE 13th International Multitopic Conference.

[19]  Bidyut Baran Chaudhuri,et al.  Automatic identification of English, Chinese, Arabic, Devnagari and Bangla script line , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.