Script Identification Using Gabor Feature and SVM Classifier

Abstract Script identification is challenging task in bilingual or multi-lingual optical character recognition system. A remarkable research work on script identification have been noted in Indian or non-Indian context. As many commercial and official regional documents of different states of India are in bilingual containing one regional language of respective state and the other international intersperse language English. Therefore script identification is one of the primary tasks in multi-script document recognition. English words are mostly interspersed in regional documents of different states of India. In this paper script identification of Gujarati and English at word level is presented. For feature extraction the directional energy distribution of a word using Gabor filters is used with suitable frequencies and orientations. The proposed system uses SVM classifier to classify the extracted features in one of the script. The results obtained are quiet encouraging.

[1]  A. G. Ramakrishnan,et al.  Bilingual (Tamil - Roman) Text Recognition on Windows , 2002 .

[2]  R. S. Kunte,et al.  A Bilingual Machine-Interface OCR for Printed Kannada and English Text Employing Wavelet Features , 2007 .

[3]  Debashis Ghosh,et al.  Script Recognition—A Review , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  P. Nagabhushan,et al.  Script Identification Based on Morphological Reconstruction in Document Images , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[5]  N. V. Subbareddy,et al.  Neural network based system for script identification in Indian documents , 2002 .

[6]  Bidyut Baran Chaudhuri,et al.  An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi) , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[7]  A. G. Ramakrishnan,et al.  Script identification in printed bilingual documents , 2002 .

[8]  P. A. Vijaya,et al.  Global Approach for Script Identification using Wavelet Packet Based Features , 2010 .

[9]  Basanna V. Dhandra,et al.  Word-wise Script Identification from Bilingual Documents Based on Morphological Reconstruction , 2007, 2006 1st International Conference on Digital Information Management.

[10]  U. Pal,et al.  Multi-script line identification from Indian documents , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[11]  Bidyut Baran Chaudhuri,et al.  Automatic identification of English, Chinese, Arabic, Devnagari and Bangla script line , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[12]  Umapada Pal,et al.  Two-stage Approach for Word-wise Script Identification , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[13]  D. S. Guru,et al.  Appearance Based Models in Document Script Identification , 2007 .

[14]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[15]  Bidyut Baran Chaudhuri,et al.  Script line separation from Indian multi-script documents , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[16]  Basanna V. Dhandra,et al.  On Separation of English Numerals from Multilingual Document Images , 2007, J. Multim..

[17]  V. S. Malemath,et al.  WORD-WISE SCRIPT IDENTIFICATION BASED ON MORPHOLOGICAL RECONSTRUCTION IN PRINTED BILINGUAL DOCUMENTS , 2006 .