Amalgamated Approach for Devanagari Script Corpus for OCR & Demographic Purpose and XML for Linguistic Annotation

In this paper, we present compilation of Hindi handwritten text image Corpus and its linguistics perspective in the field of OCR and information retrieval from handwritten document. Devnagari script is little bit complicated to enter a single character; it requires a combination of multiples, due to use of modifier. A mixed approach is proposed and demonstrated for Hindi Corpus for OCR and Demographic data collection. Demographic part of database could be used to train a system to fetch the data automatically, which will be helpful to simplify existing manual data-processing task involved in the field of data collection such as input forms like AADHAR, driving license, Railway Reservation etc. This would increase the participation of Hindi language community in understanding and taking benefit of the government schemes. To make availability and applicability of database in a vast area of corpus linguistics, we propose a methodology for data collection, mark-up, digital transcription, and XML metadata information for benchmarking and ZipF' s law to analyze the distribution and behavior of words in the corpus.

[1]  Mohammad Alshayeb,et al.  KHATT: Arabic Offline Handwritten Text Database , 2012, 2012 International Conference on Frontiers in Handwriting Recognition.

[2]  Fei Yin,et al.  CASIA Online and Offline Chinese Handwriting Databases , 2011, 2011 International Conference on Document Analysis and Recognition.

[3]  Alireza Alaei,et al.  Dataset and Ground Truth for Handwritten Text in Four Different Scripts , 2012, Int. J. Pattern Recognit. Artif. Intell..

[4]  Neeta Nain,et al.  An annotated Urdu corpus of handwritten text image and benchmarking of corpus , 2014, 2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO).

[5]  Neeta Nain,et al.  A Four-Tier Annotated Urdu Handwritten Text Image Dataset for Multidisciplinary Research on Urdu Script , 2016, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[6]  Umapada Pal,et al.  Database Development and Recognition of Handwritten Devanagari Legal Amount Words , 2011, 2011 International Conference on Document Analysis and Recognition.

[7]  Neeta Nain,et al.  A Framework for Compilation of Multi-lingual Handwritten Database: Four Levels XML Ground-Truth , 2015, 2015 11th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS).

[8]  L. Deng,et al.  The MNIST Database of Handwritten Digit Images for Machine Learning Research [Best of the Web] , 2012, IEEE Signal Processing Magazine.