Script Identification – A Han and Roman Script Perspective

All Han-based scripts (Chinese, Japanese, and Korean) possess similar visual characteristics. Hence system development for identification of Chinese, Japanese and Korean scripts from a single document page is quite challenging. It is noted that a Han-based document page might also have Roman script in them. A multi-script OCR system dealing with Chinese, Japanese, Korean, and Roman scripts, demands identification of scripts before execution of respective OCR modules. We propose a system to address this problem using directional features along with a Gaussian Kernel-based Support Vector Machine. We got promising results of 98.39% script identification accuracy at character level and 99.85% at block level, when no rejection was considered.

[1]  Jie Ding,et al.  Classification of oriental and European scripts by using characteristic features , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[2]  Tieniu Tan,et al.  Rotation Invariant Texture Features and Their Use in Automatic Script Identification , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Chih-Jen Lin,et al.  Probability Estimates for Multi-class Classification by Pairwise Coupling , 2003, J. Mach. Learn. Res..

[4]  Fumitaka Kimura,et al.  Identification of Japanese and English Script from a Single Document Page , 2007, 7th IEEE International Conference on Computer and Information Technology (CIT 2007).

[5]  Patrick Kelly,et al.  Automatic Script Identification From Document Images Using Cluster-Based Templates , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  David S. Doermann,et al.  Identifying script on word-level with informational confidence , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[7]  Shijian Lu,et al.  Script and Language Identification in Noisy and Degraded Document Images , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Ching Y. Suen,et al.  Script identification using steerable Gabor filters , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[9]  A. Lawrence Spitz,et al.  Determination of the Script and Language Content of Document Images , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Tetsushi Wakabayashi,et al.  Handwritten Numeral Recognition of Six Popular Indian Scripts , 2007 .

[11]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.