Modified Gabor Feature Extraction Method for Word Level Script Identification- Experimentation with Gurumukhi and English Scripts

Script Identification is one of the challenging step in the Optical Character Recognition system for multi-script documents. In Indian and Non-Indian context some results have been reported, but research in this field is still emerging. This paper presents a research work in the identification of Gurmukhi and English scripts at word level. It also identifies English Numerals from Gurmukhi text. Gabor feature extraction is one of most popular method for script recognition. This paper presents a zone based gabor feature extraction technique. The given word image after normalization is divided into different zones of different sizes and then features from each of these zones are extracted in various directions using gabor filters. Script is then determined by using SVM classifier. The experimental tests carried out in the field of Gurmukhi and English Script recognition show that the proposed technique leads to improvement over the traditional Gabor feature extraction without zoning. In future, this can also be extended for other scripts.

[1]  Mahantapas Kundu,et al.  A statistical-topological feature combination for recognition of handwritten numerals , 2012, Appl. Soft Comput..

[2]  Tieniu Tan,et al.  Rotation Invariant Texture Features and Their Use in Automatic Script Identification , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Renu Dhir,et al.  Comparative Analysis of Gabor and Discriminating Feature Extraction Techniques for Script Identification , 2011, ICIS 2011.

[4]  A. G. Ramakrishnan,et al.  HVS Inspired System for Script Identification in Indian Multi-script Documents , 2006, Document Analysis Systems.

[5]  A. Lawrence Spitz,et al.  Determination of the Script and Language Content of Document Images , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[7]  S Abirami Scripts and Numerals Identification From Printed Multilingual Document Images , 2011 .

[8]  A.G. Ramakrishnan,et al.  Gabor filters for document analysis in Indian bilingual documents , 2004, International Conference on Intelligent Sensing and Information Processing, 2004. Proceedings of.

[9]  Debashis Ghosh,et al.  Script Recognition—A Review , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Patrick Kelly,et al.  Automatic Script Identification From Document Images Using Cluster-Based Templates , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[12]  Bidyut Baran Chaudhuri,et al.  Script line separation from Indian multi-script documents , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[13]  Basanna V. Dhandra,et al.  On Separation of English Numerals from Multilingual Document Images , 2007, J. Multim..

[14]  Bidyut Baran Chaudhuri,et al.  Identification of different script lines from multi-script documents , 2002, Image Vis. Comput..

[15]  Renu Dhir,et al.  A Structural Feature based approach for script identification of Gurmukhi and Roman characters and words , 2009 .

[16]  Jayanthi Sivaswamy,et al.  Script Identification from Indian Documents , 2006, Document Analysis Systems.

[17]  C. V. Jawahar,et al.  A bilingual OCR for Hindi-Telugu documents and its applications , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[18]  A. G. Ramakrishnan,et al.  Word level multi-script identification , 2008, Pattern Recognit. Lett..

[19]  Umapada Pal,et al.  Word-Wise Thai and Roman Script Identification , 2009, TALIP.

[20]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[21]  David G. Stork,et al.  Pattern Classification , 1973 .

[22]  A. G. Ramakrishnan,et al.  Script identification in printed bilingual documents , 2002, Document Analysis Systems.

[23]  U. Pal,et al.  Multi-script line identification from Indian documents , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[24]  Bidyut Baran Chaudhuri,et al.  Word-Wise Script Identification from Indian Documents , 2004, Document Analysis Systems.

[25]  U. Pal,et al.  English, Devnagari and Urdu Text Identification , 2005 .

[26]  Gurpreet Singh Lehal,et al.  Feature Extraction and Classification for OCR of Gurmukhi Script , 2006 .

[27]  Santanu Chaudhury,et al.  Trainable script identification strategies for Indian languages , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[28]  Sally L. Wood,et al.  Language identification for printed text independent of segmentation , 1995, Proceedings., International Conference on Image Processing.