Frequently Used Devanagari Words in Marathi and Pali Language Documents

Optical character recognition (OCR) deals with the recognition of printed or handwritten characters. India being the multilingual country, and possessing the historical information in some of the old languages being practiced in India since ancient time, it is obvious that important information is still to be discovered from these ancient available documents. In this paper, we have devised a method which will identify the language of the script under observation. Character recognition itself is a challenging problem because of the variation in the font and size of the characters. In this paper, a scheme is developed for complete OCR for Marathi and Pali languages. The proposed system successfully segments out the lines and words of the Marathi and Pali documents. The proposed system is evaluated on ten Marathi and ten Pali documents comprised of 552 text lines and 6430 words. We obtained the promising results on the line segmentation with an accuracy of 99.25% and 98.6% and for word segmentation 97.6% and 96.5%, respectively, on Marathi and Pali language documents. Using K-NN classifier, the most frequently used words in Marathi and Pali documents are identified.

[1]  Wei-Hsien Wu,et al.  Recursive hierarchical radical extraction for handwritten Chinese characters , 1997, Pattern Recognit..

[2]  Mahantapas Kundu,et al.  Combining Multiple Feature Extraction Techniques for Handwritten Devnagari Character Recognition , 2008, 2008 IEEE Region 10 and the Third international Conference on Industrial and Information Systems.

[3]  Parshuram M. Kamble,et al.  Handwritten Marathi character recognition using R -HOG Feature , 2015 .

[4]  Jarernsri L. Mitrpanont,et al.  Using Contour Analysis to Improve Feature Extraction in Thai Handwritten Character Recognition Systems , 2007, 7th IEEE International Conference on Computer and Information Technology (CIT 2007).

[5]  Sushama Shelke,et al.  A Novel Multi-feature Multi-classifier Scheme for Unconstrained Handwritten Devanagari Character Recognition , 2010, 2010 12th International Conference on Frontiers in Handwriting Recognition.

[6]  Mita Nasipuri,et al.  Performance Comparison of SVM and ANN for Handwritten Devnagari Character Recognition , 2010, ArXiv.

[7]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[8]  Toru Wakahara,et al.  Handwritten Japanese character recognition using adaptive normalization by global affine transformation , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[9]  Venu Govindaraju,et al.  Guide to OCR for Indic Scripts: Document Recognition and Retrieval , 2009 .

[10]  Umapada Pal,et al.  Offline Recognition of Devanagari Script: A Survey , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[11]  Bidyut Baran Chaudhuri,et al.  Handwritten Numeral Databases of Indian Scripts and Multistage Recognition of Mixed Numerals , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Ramandeep Kaur,et al.  Recognition of similar shaped isolated handwritten Gurumukhi characters using machine learning , 2014, 2014 5th International Conference - Confluence The Next Generation Information Technology Summit (Confluence).

[13]  Vijay H. Mankar,et al.  A Review of Research on Devnagari Character Recognition , 2010, ArXiv.

[14]  Venu Govindaraju,et al.  Guide to OCR for Indic Scripts , 2010 .

[15]  Tetsushi Wakabayashi,et al.  Off-Line Handwritten Character Recognition of Devnagari Script , 2007 .