Language identification of character images using machine learning techniques

In this paper, we propose a new approach for identifying the language type of character images. We do this by classifying individual character images to determine the language boundaries in multilingual documents. Two effective methods are considered for this purpose: the prototype classification method and support vector machines (SVM). Due to the large size of our training data set, we further propose a technique to speed up the training process for both methods. Applying the two methods to classifying characters into Chinese, English, and Japanese (including Hiragana and Katakana) has produced very accurate and comparable test results.

[1]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[2]  Chun-Jen Chen,et al.  Applying a hybrid method to handwritten character recognition , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[3]  Sally L. Wood,et al.  Language identification for printed text independent of segmentation , 1995, Proceedings., International Conference on Image Processing.

[4]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[5]  Ching Y. Suen,et al.  Language identification of on-line documents using word shapes , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[6]  Tony R. Martinez,et al.  Reduction Techniques for Instance-Based Learning Algorithms , 2000, Machine Learning.

[7]  Chien-Hsing Chou,et al.  A prototype classification method and its application to handwritten character recognition , 2004, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583).

[8]  A. Lawrence Spitz,et al.  European language determination from image , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[9]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[10]  C. G. Hilborn,et al.  The Condensed Nearest Neighbor Rule , 1967 .

[11]  Somchai Jitapunkul,et al.  Language-based hand-printed character recognition: a novel method using spatial and temporal informative features , 2003, 2003 IEEE XIII Workshop on Neural Networks for Signal Processing (IEEE Cat. No.03TH8718).

[12]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[13]  Doug Cooper,et al.  How to read less and know more: approximate OCR for Thai , 1997, SIGIR '97.

[14]  Patrick Kelly,et al.  Script and language identification for handwritten document images , 1999, International Journal on Document Analysis and Recognition.

[15]  A. Lawrence Spitz,et al.  Determination of the Script and Language Content of Document Images , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Dat Tran,et al.  VQ-based written language identification , 2003, Seventh International Symposium on Signal Processing and Its Applications, 2003. Proceedings..

[17]  Chun-Jen Chen,et al.  Applying a hybrid method to handwritten character recognition , 2004, ICPR 2004.