Word-Wise Handwritten Persian and Roman Script Identification

Most of the countries use bi-script documents. This is because every country uses its own national language and English as second/foreign language. Therefore, bi-lingual document with one language being the English and other being the national language is very common. Postal documents are a very good example of such bi-lingual/script document. This paper deals with word-wise handwritten script identification from bi-script documents written in Persian and Roman. In the proposed scheme, simple but fast computable set of 12 features based on fractal dimension, position of small component, topology etc. are used and a set of classifiers are employed for script identification experiments. We tested our scheme on a dataset of 5000 handwritten Persian and English words and 99.20% of correct script identification is obtained.

[1]  Patrick Kelly,et al.  Script and language identification for handwritten document images , 1999, International Journal on Document Analysis and Recognition.

[2]  Kaushik Roy,et al.  Trilingual Script Separation of Handwritten Postal Document , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[3]  Yue Lu,et al.  Bangla/English Script Identification Based on Analysis of Connected Component Profiles , 2006, Document Analysis Systems.

[4]  J. Sil,et al.  Cluster Validation Using Splitting and Merging Technique , 2007, International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007).

[5]  Bidyut Baran Chaudhuri,et al.  Script line separation from Indian multi-script documents , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[6]  Sargur N. Srihari,et al.  On-Line and Off-Line Handwriting Recognition: A Comprehensive Survey , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Mallikarjun Hangarge,et al.  Global and Local Features Based Handwritten Text Words and Numerals Script Identification , 2007, International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007).

[8]  Subhadip Basu,et al.  Word level Script Identification from Bangla and Devanagri Handwritten Texts mixed with Roman Script , 2010, ArXiv.

[9]  Bidyut Baran Chaudhuri,et al.  Identification of different script lines from multi-script documents , 2002, Image Vis. Comput..

[10]  Fumitaka Kimura,et al.  A Lexicon-Driven Handwritten City-Name Recognition Scheme for Indian Postal Automation , 2009, IEICE Trans. Inf. Syst..

[11]  A. Lawrence Spitz,et al.  Determination of the Script and Language Content of Document Images , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  C. Sparrow The Fractal Geometry of Nature , 1984 .

[13]  U. Pal,et al.  Neural network based word-wise handwritten script identification system for Indian postal automation , 2005, Proceedings of 2005 International Conference on Intelligent Sensing and Information Processing, 2005..

[14]  Alireza Alaei,et al.  Fine Classification of Unconstrained Handwritten Persian/Arabic Numerals by Removing Confusion amongst Similar Classes , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[15]  Mahantapas Kundu,et al.  Comparison of the Multi Layer Perceptron and the Nearest Neighbor Classifier for Handwritten Numeral Recognition , 2005, J. Inf. Sci. Eng..

[16]  Santanu Chaudhury,et al.  Trainable script identification strategies for Indian languages , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[17]  Adel M. Alimi,et al.  Arabic and Latin Script Identification in Printed and Handwritten Types Based on Steerable Pyramid Features , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[18]  U. Pal,et al.  Multi-script line identification from Indian documents , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[19]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[20]  Adel M. Alimi,et al.  Script and nature differentiation for Arabic and Latin text images , 2002, Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition.

[21]  Mohamed A. Ismail,et al.  Techniques for language identification for hybrid Arabic-English document images , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[22]  P. Lewis Ethnologue : languages of the world , 2009 .