Script Identification from Printed Indian Document Images and Performance Evaluation Using Different Classifiers

Identification of script from document images is an active area of research under document image processing for a multilingual/ multiscript country like India. In this paper the real life problem of printed script identification from official Indian document images is considered and performances of different well-known classifiers are evaluated. Two important evaluating parameters, namely, AAR (average accuracy rate) and MBT (model building time), are computed for this performance analysis. Experiment was carried out on 459 printed document images with 5-fold cross-validation. Simple Logistic model shows highest AAR of 98.9% among all. BayesNet and Random Forest model have average accuracy rate of 96.7% and 98.2% correspondingly with lowest MBT of 0.09 s.

[1]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[2]  Renu Dhir,et al.  Script Identification of Pre-segmented Multi-font Characters and Digits , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[3]  U. Pal,et al.  A system for word-wise handwritten script identification for Indian postal automation , 2004, Proceedings of the IEEE INDICON 2004. First India Annual Conference, 2004..

[4]  Vivek Singhal,et al.  Script-based classification of hand-written text documents in a multilingual environment , 2003, Proceedings. Seventeenth Workshop on Parallel and Distributed Simulation.

[5]  N. V. Subbareddy,et al.  Neural network based system for script identification in Indian documents , 2002 .

[6]  P. Nagabhushan,et al.  Script Identification Based on Morphological Reconstruction in Document Images , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[7]  Santanu Chaudhury,et al.  Trainable script identification strategies for Indian languages , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[8]  Martin D. Buhmann,et al.  Radial Basis Functions , 2021, Encyclopedia of Mathematical Geosciences.

[9]  Chew Lim Tan,et al.  Language Identification in Multilingual Documents , 2003 .

[10]  Jie Ding,et al.  Differential Between Oriental and European Scripts by Statistical Features , 1998, Int. J. Pattern Recognit. Artif. Intell..

[11]  Uday V. Kulkarni,et al.  Template Matching Algorithm for Gujrati Character Recognition , 2009, 2009 Second International Conference on Emerging Trends in Engineering & Technology.

[12]  Sk. Md. Obaidullah,et al.  Script Identification from Handwritten Document , 2011, 2011 Third National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics.

[13]  Patrick Kelly,et al.  Script and language identification for handwritten document images , 1999, International Journal on Document Analysis and Recognition.

[14]  Sk Md Obaidullah,et al.  A System for Handwritten Script Identification From Indian Document , 2013 .

[15]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[16]  Subhadip Basu,et al.  A novel framework for automatic sorting of postal documents with multi-script address blocks , 2010, Pattern Recognit..

[17]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[18]  Debashis Ghosh,et al.  Script Recognition—A Review , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Joydeep Ghosh,et al.  Scale-based clustering using the radial basis function network , 1994, Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94).

[20]  Yue Lu,et al.  Bangla/English Script Identification Based on Analysis of Connected Component Profiles , 2006, Document Analysis Systems.

[21]  M. C. Padma,et al.  Wavelet Packet Based Texture Features for Automatic Script Identification , 2010 .

[22]  Hilary Buxton,et al.  RBF Network Methods for Face Detection and Attentional Frames , 2004, Neural Processing Letters.

[23]  A. Lawrence Spitz,et al.  Determination of the Script and Language Content of Document Images , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[24]  Mohamed A. Ismail,et al.  Techniques for language identification for hybrid Arabic-English document images , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[25]  Patrick Kelly,et al.  Automatic Script Identification From Document Images Using Cluster-Based Templates , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  Jayanthi Sivaswamy,et al.  Script Identification from Indian Documents , 2006, Document Analysis Systems.

[27]  Mallikarjun Hangarge,et al.  Directional Discrete Cosine Transform for Handwritten Script Identification , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[28]  Eyke Hüllermeier,et al.  FURIA: an algorithm for unordered fuzzy rule induction , 2009, Data Mining and Knowledge Discovery.

[29]  Adel M. Alimi,et al.  Fractal-based system for Arabic/Latin, printed/handwritten script identification , 2008, 2008 19th International Conference on Pattern Recognition.

[30]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[31]  Bidyut Baran Chaudhuri,et al.  An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi) , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[32]  A. G. Ramakrishnan,et al.  Script identification in printed bilingual documents , 2002, Document Analysis Systems.

[33]  Martin D. Buhmann,et al.  Radial Basis Functions: Theory and Implementations: Preface , 2003 .