Character Level Separation and Identification of English and Gujarati Digits from Bilingual (English-Gujarati) Printed Documents

it is observed that English script has interspersed within the Indian languages. So there is a need for an optical character recognition (OCR) system which can recognize these bilingual documents and store it for future use. Hence, in this paper an OCR system is proposed that can read documents containing Gujarati and English scripts (Only digits). These scripts have many features in common and hence a single system can be modelled to recognize them. Here, we have used template matching classifier. The normalized feature vector is used as a feature to classify English and Gujarati digits. The system shows a good performance for multi-font, size independent printed bilingual English- Gujarati digits. An average classification rate 98.30% is obtained for Gujarati digits and 98.88% is obtained for English digits at character level.

[1]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[2]  Apurva Desai Handwritten Gujarati Numeral Optical Character Recognition using Hybrid Feature Extraction Technique , 2010, IPCV.

[3]  Bidyut Baran Chaudhuri,et al.  Automatic separation of machine-printed and hand-written text lines , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[4]  P. Nagabhushan,et al.  Language Identification from an Indian Multilingual Document Using Profile Features , 2009, 2009 International Conference on Computer and Automation Engineering.

[5]  U. Pal,et al.  Multi-script line identification from Indian documents , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[6]  P. Vanaja Ranjan,et al.  HANDWRITTEN NUMERAL/MIXED NUMERALS RECOGNITION OF SOUTH-INDIAN SCRIPTS: THE ZONE- BASED FEATURE EXTRACTION METHOD , 2009 .

[7]  Tieniu Tan,et al.  Rotation Invariant Texture Features and Their Use in Automatic Script Identification , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Bidyut Baran Chaudhuri,et al.  Automatic identification of English, Chinese, Arabic, Devnagari and Bangla script line , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[9]  Bidyut Baran Chaudhuri,et al.  Script line separation from Indian multi-script documents , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[10]  Bidyut Baran Chaudhuri,et al.  Automatic separation of words in multi-lingual multi-script Indian documents , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[11]  N. V. Subbareddy,et al.  Neural network based system for script identification in Indian documents , 2002 .

[12]  R. D. Sudhaker Samuel,et al.  A Novel Bilingual OCR for Printed Malayalam-English Text Based on Gabor Features and Dominant Singular Values , 2009, 2009 International Conference on Digital Image Processing.

[13]  A Sharma Design and Implementation of Optical Character Recognition System to Recognize Gujarati Script using Template Matching , 2006 .

[14]  A. Lawrence Spitz,et al.  Determination of the Script and Language Content of Document Images , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Sameer Antani,et al.  Gujarati character recognition , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).