A comprehensive handwritten Indic script recognition system: a tree-based approach

A noteworthy achievement has been accomplished in developing optical character recognition (OCR) systems for different Indic scripts handwritten document images. But in a multi-script country like India, this cannot serve the entire purpose of document digitization when such multi-script document images need to be converted into machine readable form. But developing a script-invariant OCR engine is almost impossible. Therefore, in any multi-script environment, a complete framework of script identification module is very essential before starting the actual document digitization through OCR engine. Keeping this research need in mind, in this paper, we propose a novel handwritten script recognition model considering all the 12 officially recognized scripts in India. The classification task is performed at word-level using a tree-based approach where the Matra-based scripts are firstly separated from non-Matra scripts using distance-Hough transform (DHT) algorithm. Next, the Matra and non-Matra based scripts are individually identified using modified log-Gabor filter based features applied at multi-scale and multi-orientation. Encouraging outcomes establish the efficacy of the present tree-based approach to the classification of handwritten Indic scripts.

[1]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[2]  O. J. Dunn Multiple Comparisons among Means , 1961 .

[3]  Subhadip Basu,et al.  Benchmark databases of handwritten Bangla-Roman and Devanagari-Roman mixed-script document images , 2018, Multimedia Tools and Applications.

[4]  A. G. Ramakrishnan,et al.  Word level multi-script identification , 2008, Pattern Recognit. Lett..

[5]  Mita Nasipuri,et al.  A Two-Stage Approach to Handwritten Indic Script Identification , 2017 .

[6]  Nibaran Das,et al.  Separating Indic Scripts with matra for Effective Handwritten Script Identification in Multi-Script Documents , 2017, Int. J. Pattern Recognit. Artif. Intell..

[7]  Nibaran Das,et al.  Automatic Indic script identification from handwritten documents: page, block, line and word-level approach , 2019, Int. J. Mach. Learn. Cybern..

[8]  Basanna V. Dhandra,et al.  Word-wise Script Identification from Bilingual Documents Based on Morphological Reconstruction , 2007, 2006 1st International Conference on Digital Information Management.

[9]  Mita Nasipuri,et al.  Word-Level Script Identification from Handwritten Multi-script Documents , 2014, FICTA.

[10]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[11]  A. G. Ramakrishnan,et al.  Script identification in printed bilingual documents , 2002 .

[12]  Mita Nasipuri,et al.  Word-level script identification for handwritten Indic scripts , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[13]  Subhadip Basu,et al.  Statistical comparison of classifiers for script identification from multi-script handwritten documents , 2014, Int. J. Appl. Pattern Recognit..

[14]  Richard O. Duda,et al.  Use of the Hough transformation to detect lines and curves in pictures , 1972, CACM.

[15]  Bidyut Baran Chaudhuri,et al.  Automatic separation of words in multi-lingual multi-script Indian documents , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[16]  N. V. Subbareddy,et al.  Neural network based system for script identification in Indian documents , 2002 .

[17]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[18]  Pinar Duygulu Sahin,et al.  A line-based representation for matching words in historical manuscripts , 2011, Pattern Recognit. Lett..

[19]  P. S. Hiremath,et al.  Wavelet based co-occurrence histogram features for texture classification with an application to script identification in a document image , 2008, Pattern Recognit. Lett..

[20]  Subhadip Basu,et al.  Handwritten Bangla Alphabet Recognition using an MLP Based Classifier , 2012, ArXiv.

[21]  Azriel Rosenfeld,et al.  Sequential Operations in Digital Picture Processing , 1966, JACM.

[22]  Mita Nasipuri,et al.  Page-level script identification from multi-script handwritten documents , 2015, International Conference on Computer, Communication, Control and Information Technology.

[23]  Umapada Pal,et al.  Two-stage Approach for Word-wise Script Identification , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[24]  Bidyut Baran Chaudhuri,et al.  Automatic Handwritten Indian Scripts Identification , 2014, 2014 14th International Conference on Frontiers in Handwriting Recognition.

[25]  Hiroshi Kawakami,et al.  Morphological preprocessing method to thresholding degraded word images , 2009, Pattern Recognit. Lett..

[26]  R. Iman,et al.  Approximations of the critical region of the fbietkan statistic , 1980 .

[27]  Shailesh A. Chaudhari,et al.  Script Identification Using Gabor Feature and SVM Classifier , 2016 .

[28]  Mita Nasipuri,et al.  Offline Script Identification from multilingual Indic-script documents: A state-of-the-art , 2015, Comput. Sci. Rev..

[29]  Parul Sahare,et al.  Word Level Multi-Script Identification Using Curvelet Transform in Log-Polar Domain , 2019 .

[30]  Bidyut Baran Chaudhuri,et al.  Machine-printed and hand-written text lines identification , 2001, Pattern Recognit. Lett..

[31]  Nibaran Das,et al.  Indic script identification from handwritten document images — An unconstrained block-level approach , 2015, 2015 IEEE 2nd International Conference on Recent Trends in Information Systems (ReTIS).

[32]  Umapada Pal,et al.  Word-wise Sinhala Tamil and English script identification using Gaussian kernel SVM , 2008, 2008 19th International Conference on Pattern Recognition.

[33]  Mallikarjun Hangarge,et al.  Directional Discrete Cosine Transform for Handwritten Script Identification , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[34]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[35]  Ram Sarkar,et al.  Page-level handwritten script identification using modified log-Gabor filter based features , 2015, 2015 IEEE 2nd International Conference on Recent Trends in Information Systems (ReTIS).