Page-level handwritten script identification using modified log-Gabor filter based features

Automatic identification of scripts, an imperative research problem during the last few decades, has posed many challenges in any multi-script environment. As India is a multilingual country, therefore, text documents containing more than one language are very familiar phenomenon here. But to digitize these multi-lingual documents using any Optical Character Recognition (OCR) engine, first it is required to recognize the scripts used to write the same. In this paper, a page-level script identification technique for eight popular handwritten scripts namely, Bangla, Devanagari, Gurumukhi, Oriya, Tamil, Telugu, Urdu along with Roman has been proposed. To start with, Modified log-Gabor filters based texture features are designed from each of the document pages. Then the proposed model is evaluated using multiple classifiers and based on their identification accuracies, it is found that Simple Logistic performs the best. Outcome of the present experiment reveals the usefulness of the Modified log-Gabor filters based features in recognition of handwritten Indic scripts. A total of 240 document pages is used to carry out the present experiment and it yields 95.57% accuracy in identifying the scripts of the documents. Even if the proposed method is assessed on limited dataset, but considering the intricacies of the scripts, the outcome can be assumed reasonably acceptable.

[1]  Subhadip Basu,et al.  Identification of Devnagari and Roman Scripts from Multi-script Handwritten Documents , 2013, PReMI.

[2]  Subhadip Basu,et al.  Word level Script Identification from Bangla and Devanagri Handwritten Texts mixed with Roman Script , 2010, ArXiv.

[3]  K. Roy,et al.  Word-wise Hand-written Script Separation for Indian Postal automation , 2006 .

[4]  Rafael C. González,et al.  Local Determination of a Moving Contrast Edge , 1985, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Umapada Pal,et al.  Two-stage Approach for Word-wise Script Identification , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[6]  Tieniu Tan,et al.  Rotation Invariant Texture Features and Their Use in Automatic Script Identification , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  G. S. Peake,et al.  Script and language identification from document images , 1997, Proceedings Workshop on Document Image Analysis (DIA'97).

[8]  P. S. Hiremath,et al.  Wavelet based co-occurrence histogram features for texture classification with an application to script identification in a document image , 2008, Pattern Recognit. Lett..

[9]  A. Lawrence Spitz,et al.  Determination of the Script and Language Content of Document Images , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Subhadip Basu,et al.  Statistical comparison of classifiers for script identification from multi-script handwritten documents , 2014, Int. J. Appl. Pattern Recognit..

[11]  A. G. Wright A Business Application of a Digital Computer , 1959, Comput. J..

[12]  D J Field,et al.  Relations between the statistics of natural images and the response properties of cortical cells. , 1987, Journal of the Optical Society of America. A, Optics and image science.

[13]  Sally L. Wood,et al.  Language identification for printed text independent of segmentation , 1995, Proceedings., International Conference on Image Processing.

[14]  Patrick Kelly,et al.  Script and language identification for handwritten document images , 1999, International Journal on Document Analysis and Recognition.