Improving Classical OCRs for Brahmic Scripts Using Script Grammar Learning

Classical OCRs based on isolated character (symbol) recognition have been the fundamental way of generating textual representations, particularly for Indian scripts, until the time transcription-based approaches gained momentum. Though the former approaches have been criticized as prone to failures, their accuracy has nevertheless been fairly decent in comparison with the newer transcription-based approaches. Analysis of isolated character recognition OCRs for Hindi and Bangla revealed most errors were generated in converting the output of the classifier to valid Unicode sequences, i.e., script grammar generation. Linguistic rules to generate scripts are inadequately integrated, thus resulting in a rigid Unicode generation scheme which is cumbersome to understand and error prone in adapting to new Indian scripts. In this paper we propose a machine learning-based classifier symbols to Unicode generation scheme which outperforms the existing generation scheme and improves accuracy for Devanagari and Bangla scripts.

[1]  Hermann Ney,et al.  On the integration of speech recognition and statistical machine translation , 2005, INTERSPEECH.

[2]  Thomas G. Dietterich,et al.  Learning Scripts as Hidden Markov Models , 2014, AAAI.

[3]  Bidyut Baran Chaudhuri,et al.  Indian script character recognition: a survey , 2004, Pattern Recognit..

[4]  Veena Bansal,et al.  Partitioning and searching dictionary for correction of optically read Devanagari character strings , 2002, International Journal on Document Analysis and Recognition.

[5]  Veena Bansal,et al.  Partitioning and searching dictionary for correction of optically read Devanagari character strings , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[6]  Veena Bansal,et al.  Segmentation of touching and fused Devanagari characters , 2002, Pattern Recognit..

[7]  C. V. Jawahar,et al.  Content-level Annotation of Large Collection of Printed Document Images , 2007 .

[8]  Veena Bansal,et al.  A complete OCR for printed Hindi text in Devanagari script , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[9]  Bidyut Baran Chaudhuri,et al.  An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi) , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[10]  Stephan Vogel,et al.  Solving substitution ciphers for OCR with a semi-supervised hidden Markov model , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[11]  Bidyut B. Chaudhuri,et al.  Segmentation of touching characters in printed Devnagari and Bangla scripts using fuzzy multifactorial analysis , 2002 .