Word-wise Sinhala Tamil and English script identification using Gaussian kernel SVM

There are many documents in Srilanka where a single document page may contain Sinhala, Tamil and English texts. For OCR development of such a document page, it is better to identify different scripts present in the page and then feed the identified portion to the respective OCR module. In this paper, a SVM based technique is proposed for word-wise identification of Sinhala, Tamil and English scripts from a single document page. Structural features, topological features and water reservoir principle based features are mainly used here for the purpose. From the experiment we obtained encouraging results.

[1]  A. Lawrence Spitz,et al.  Determination of the Script and Language Content of Document Images , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Fumitaka Kimura,et al.  Identification of Japanese and English Script from a Single Document Page , 2007, 7th IEEE International Conference on Computer and Information Technology (CIT 2007).

[3]  Yue Lu,et al.  Bangla/English Script Identification Based on Analysis of Connected Component Profiles , 2006, Document Analysis Systems.

[4]  U. Pal,et al.  Multi-script line identification from Indian documents , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[5]  Umapada Pal,et al.  Touching numeral segmentation using water reservoir concept , 2003, Pattern Recognit. Lett..

[6]  Tieniu Tan,et al.  Rotation Invariant Texture Features and Their Use in Automatic Script Identification , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  David S. Doermann,et al.  Identifying script on word-level with informational confidence , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[8]  A. G. Ramakrishnan,et al.  Script identification in printed bilingual documents , 2002 .

[9]  A. G. Ramakrishnan,et al.  Script identification in printed bilingual documents , 2002, Document Analysis Systems.

[10]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.