Text line script identification for a tri-lingual document

India is a multilingual multi-script country. States of India follow a three language formula. The document may be printed in English, Hindi and other state official language. For example in Karnataka, a state in India, the document may contain text lines in English, Hindi script. For Optical Character Recognition (OCR) of such a multilingual document, it is necessary to identify the script before feeding the text lines to the OCRs of individual scripts. In this paper, a simple and efficient technique of script identification for Kannada, Hindi and English text lines from a printed document is presented. The proposed system uses horizontal projection profile to distinguish the three scripts. The feature extraction is done based on the horizontal projection profile of each text line. The knowledge base of the system is developed based on 15 different document images containing about 450 text lines. For a new text line, necessary features are extracted from the horizontal projection profile and compared with the stored knowledge base to classify the script. The proposed system is tested on 20 different document images containing about 200 text lines of each script and an overall classification rate of 99.83% is achieved.

[1]  Bidyut Baran Chaudhuri,et al.  Script line separation from Indian multi-script documents , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[2]  M. C. Padma,et al.  Text Line Identification from a Multilingual Document , 2009, 2009 International Conference on Digital Image Processing.

[3]  Mohamed A. Ismail,et al.  Techniques for language identification for hybrid Arabic-English document images , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[4]  Venu Govindaraju,et al.  Document image analysis: A primer , 2002 .

[5]  N. V. Subbareddy,et al.  Neural network based system for script identification in Indian documents , 2002 .

[6]  A. G. Ramakrishnan,et al.  Script identification in printed bilingual documents , 2002, Document Analysis Systems.

[7]  Tieniu Tan,et al.  Rotation Invariant Texture Features and Their Use in Automatic Script Identification , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  V. S. Malemath,et al.  Word Level Script Identification in Bilingual Documents through Discriminating Features , 2007, 2007 International Conference on Signal Processing, Communications and Networking.

[9]  Yue Lu,et al.  Bangla/English Script Identification Based on Analysis of Connected Component Profiles , 2006, Document Analysis Systems.

[10]  U. Pal,et al.  English, Devnagari and Urdu Text Identification , 2005 .

[11]  U. Pal,et al.  Multi-script line identification from Indian documents , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[12]  Jayanthi Sivaswamy,et al.  Script Identification from Indian Documents , 2006, Document Analysis Systems.