A segmentation-free approach for printed Devanagari script recognition

Long Short-Term Memory (LSTM) networks are a suitable candidate for segmentation-free Optical Character Recognition (OCR) tasks due to their good context-aware processing. In this paper, we report the results of applying LSTM networks to Devanagari script, where each consonant-consonant conjuncts and consonant-vowel combinations take different forms based on their position in the word. We also introduce a new database, Deva-DB, of Devanagari script (free of cost) to aid the research towards a robust Devanagari OCR system. On this database, LSTM-based OCRopus system yields error rates ranging from 1.2% to 9.0% depending upon the complexity of the training and test data. Comparison with open-source Tesseract system is also presented for the same database.

[1]  Bidyut Baran Chaudhuri,et al.  An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi) , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[2]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Chandra Shekhar Yadav,et al.  Optical Character Recognition (OCR) for Printed Devnagari Script Using Artificial Neural Network , 2010 .

[4]  Saad Bin Ahmed,et al.  Offline Printed Urdu Nastaleeq Script Recognition with Bidirectional LSTM Networks , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[5]  Ujjwal Bhattacharya,et al.  Neural Combination of ANN and HMM for Handwritten Devanagari Numeral Recognition , 2006 .

[6]  Narendra S. Chaudhari,et al.  Protein SecondaryStructure Prediction witha , 2005 .

[7]  Thomas M. Breuel,et al.  High-Performance OCR for Printed English and Fraktur Using LSTM Networks , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[8]  Malayappan Shridhar,et al.  Offline Handwritten Devanagari Word Recognition: A Holistic Approach Based on Directional Chain Code Feature and HMM , 2008, 2008 International Conference on Information Technology.

[9]  Bidyut Baran Chaudhuri,et al.  Indian script character recognition: a survey , 2004, Pattern Recognit..

[10]  Venu Govindaraju,et al.  Creation of data resources and design of an evaluation test bed for Devanagari script recognition , 2003, Proceedings. Seventeenth Workshop on Parallel and Distributed Simulation.

[11]  Henry S. Baird,et al.  Document image defect models , 1995 .

[12]  Bidyut Baran Chaudhuri,et al.  Handwritten Numeral Databases of Indian Scripts and Multistage Recognition of Mixed Numerals , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Wu Wei,et al.  Online Cursive Handwriting Mongolia Words Recognition with Recurrent Neural Networks , 2011 .

[14]  C. V. Jawahar,et al.  A bilingual OCR for Hindi-Telugu documents and its applications , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[15]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[16]  Alex Graves,et al.  Supervised Sequence Labelling , 2012 .

[17]  Thomas M. Breuel,et al.  Scanning Neural Network for Text Line Recognition , 2012, 2012 10th IAPR International Workshop on Document Analysis Systems.

[18]  C. V. Jawahar,et al.  Recognition of printed Devanagari text using BLSTM Neural Network , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[19]  Yoshua Bengio,et al.  Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription , 2012, ICML.