Devanagari ancient documents recognition using statistical feature extraction techniques

Devanagari ancient document recognition process is drawing a lot of consideration from researchers nowadays. These ancient documents contain a wealth of knowledge. However, these documents are not available to all because of their fragile condition. A Devanagari ancient manuscript recognition system is designed for digital archiving. This system includes image binarization, character segmentation and recognition phases. It incorporates automatic recognition of scanned and segmented characters. Segmented characters may include basic characters (vowels and consonants), modifiers (matras) and various compound characters (characters formed by joining more than one basic characters). In this paper, handwritten Devanagari ancient manuscripts recognition system has been presented using statistical features extraction techniques. In feature extraction phase, intersection points, open endpoints, centroid, horizontal peak extent and vertical peak extent features are extracted. For classification, Convolutional Neural Network, Neural Network, Multilayer Perceptron, RBF-SVM and random forest techniques are considered in this work. Various feature extraction and classification techniques are considered and compared to the recognition of basic characters segmented from Devanagari ancient manuscripts. A data set, of 6152 pre-segmented samples of Devanagari ancient documents, is considered for experimental work. Authors have achieved 88.95% recognition accuracy using a combination of all features and a combination of all classifiers considered in this work by a simple majority voting scheme.

[1]  Swapnil Belhe,et al.  Hindi handwritten word recognition using HMM and symbol tree , 2012, DAR '12.

[2]  Yong Zhang,et al.  Differential Evolution Based Selective Ensemble of Extreme Learning Machine , 2016, 2016 IEEE Trustcom/BigDataSE/ISPA.

[3]  Monji Kherallah,et al.  A New Design Based-SVM of the CNN Classifier Architecture with Dropout for Offline Arabic Handwritten Recognition , 2016, ICCS.

[4]  Shabana Mehfuz,et al.  A hybrid recognition system for off-line handwritten characters , 2016, SpringerPlus.

[5]  Veena Bansal,et al.  A complete OCR for printed Hindi text in Devanagari script , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[6]  Rajendra Kumar Sharma,et al.  Efficient Feature Extraction Techniques for Offline Handwritten Gurmukhi Character Recognition , 2014 .

[7]  Abdel Belaïd,et al.  Hybrid OCR combination approach complemented by a specialized ICR applied on ancient documents , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[8]  Arun Kumar Misra,et al.  Software Maintenance Effort Estimation – Neural Network Vs Regression Modeling Approach , 2010 .

[9]  Chandra Shekhar Yadav,et al.  Optical Character Recognition (OCR) for Printed Devnagari Script Using Artificial Neural Network , 2010 .

[10]  Robert Sablatnig,et al.  Recognition of Degraded Handwritten Characters Using Local Features , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[11]  Xiaoyu Zhang,et al.  Application of BP-ANN and LS-SVM to discrimination of rice origin based on trace metals , 2010, 2010 International Conference on Machine Learning and Cybernetics.

[12]  Subhadip Basu,et al.  A Script Independent Technique for Extraction of Characters from Handwritten Word Images , 2010, International Journal of Computer Applications.

[13]  S. M. Jagade,et al.  A brief review and survey of feature extraction methods for Devnagari OCR , 2012, 2011 Ninth International Conference on ICT and Knowledge Engineering.

[14]  Yunxue Shao,et al.  A character image restoration method for unconstrained handwritten Chinese character recognition , 2014, International Journal on Document Analysis and Recognition (IJDAR).

[15]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[16]  Melanie Gau,et al.  Ancient document analysis based on text line extraction , 2008, 2008 19th International Conference on Pattern Recognition.

[17]  Rajendra Kumar Sharma,et al.  Offline handwritten Gurmukhi character recognition: study of different feature-classifier combinations , 2012, DAR '12.

[18]  Munish Kumar,et al.  Character and numeral recognition for non-Indic and Indic scripts: a survey , 2019, Artificial Intelligence Review.

[19]  R. Jayawardena,et al.  Validity of a food frequency questionnaire to assess nutritional intake among Sri Lankan adults , 2016, SpringerPlus.

[20]  Jin Hyung Kim,et al.  Digitalizing scheme of handwritten Hanja historical documents , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[21]  Gurpreet Singh Lehal,et al.  Feature Extraction and Classification for OCR of Gurmukhi Script , 2006 .

[22]  João Miguel da Costa Sousa,et al.  Ancient document recognition Using Fuzzy Methods , 2005, The 14th IEEE International Conference on Fuzzy Systems, 2005. FUZZ '05..

[23]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[24]  Kunal Ravindra Shah,et al.  Devnagari handwritten character recognition (DHCR) for ancient documents: A review , 2013, 2013 IEEE CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGIES.

[25]  Divakar Yadav,et al.  Optical Character Recognition for Hindi Language Using a Neural-network Approach , 2013, J. Inf. Process. Syst..

[26]  Rajendra Kumar Sharma,et al.  A Novel Feature Extraction Technique for Offline Handwritten Gurmukhi Character Recognition , 2013 .

[27]  Rajendra Kumar Sharma,et al.  A Novel Hierarchical Technique for Offline Handwritten Gurmukhi Character Recognition , 2014 .