HMM-based script identification for OCR

While current OCR systems are able to recognize text in an increasing number of scripts and languages, typically they still need to be told in advance what those scripts and languages are. We propose an approach that repurposes the same HMM-based system used for OCR to the task of script/language ID, by replacing character labels with script class labels. We apply it in a multi-pass overall OCR process which achieves "universal" OCR over 54 tested languages in 18 distinct scripts, over a wide variety of typefaces in each. For comparison we also consider a brute-force approach, wherein a singe HMM-based OCR system is trained to recognize all considered scripts. Results are presented on a large and diverse evaluation set extracted from book images, both for script identification accuracy and for overall OCR accuracy. On this evaluation data, the script ID system provided a script ID error rate of 1.73% for 18 distinct scripts. The end-to-end OCR system with the script ID system achieved a character error rate of 4.05%, an increase of 0.77% over the case where the languages are known a priori.

[1]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[2]  Shay B. Cohen,et al.  Advances in Neural Information Processing Systems 25 , 2012, NIPS 2012.

[3]  Michael E. Jahr,et al.  Translation-Inspired OCR , 2011, 2011 International Conference on Document Analysis and Recognition.

[4]  Christopher Raphael,et al.  Omnifont and unlimited-vocabulary OCR for English and Arabic , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[5]  P. S. Hiremath,et al.  Wavelet based co-occurrence histogram features for texture classification with an application to script identification in a document image , 2008, Pattern Recognit. Lett..

[6]  Sridha Sridharan,et al.  Texture for script identification , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  A. G. Ramakrishnan,et al.  Word level multi-script identification , 2008, Pattern Recognit. Lett..

[8]  Georg Heigold,et al.  Confidence- and margin-based MMI/MPE discriminative training for off-line handwriting recognition , 2011, International Journal on Document Analysis and Recognition (IJDAR).