Cross-language Framework for Word Recognition and Spotting of Indic Scripts

Abstract Handwritten word recognition and spotting of low-resource scripts are difficult as sufficient training data is not available and it is often expensive for collecting data of such scripts. This paper presents a novel cross language platform for handwritten word recognition and spotting for such low-resource scripts where training is performed with a sufficiently large dataset of an available script (considered as source script) and testing is done on other scripts (considered as target script). Training with one source script and testing with another script to have a reasonable result is not easy in handwriting domain due to the complex nature of handwriting variability among scripts. Also it is difficult in mapping between source and target characters when they appear in cursive word images. The proposed Indic cross language framework exploits a large resource of dataset for training and uses it for recognizing and spotting text of other target scripts where sufficient amount of training data is not available. Since, Indic scripts are mostly written in 3 zones, namely, upper, middle and lower, we employ zone-wise character (or component) mapping for efficient learning purpose. The performance of our cross-language framework depends on the extent of similarity between the source and target scripts. Hence, we devise an entropy based script similarity score using source to target character mapping that will provide a feasibility of cross language transcription. We have tested our approach in three Indic scripts, namely, Bangla, Devanagari and Gurumukhi, and the corresponding results are reported.

[1]  Horst Bunke,et al.  Using a Statistical Language Model to Improve the Performance of an HMM-Based Cursive Handwriting Recognition System , 2001, Int. J. Pattern Recognit. Artif. Intell..

[2]  Josep Lladós,et al.  Efficient segmentation-free keyword spotting in historical document collections , 2015, Pattern Recognit..

[3]  Shie Mannor,et al.  A Tutorial on the Cross-Entropy Method , 2005, Ann. Oper. Res..

[4]  Feiping Nie,et al.  Multiclass Capped ℓp-Norm SVM for Robust Classifications , 2017, AAAI.

[5]  Hanan Samet,et al.  A general approach to connected-component labeling for arbitrary image representations , 1992, JACM.

[6]  Mark J. F. Gales,et al.  Language independent and unsupervised acoustic models for speech recognition and keyword spotting , 2014, INTERSPEECH.

[7]  Fumitaka Kimura,et al.  Multi-lingual City Name Recognition for Indian Postal Automation , 2012, 2012 International Conference on Frontiers in Handwriting Recognition.

[8]  Venu Govindaraju,et al.  Statistical script independent word spotting in offline handwritten documents , 2014, Pattern Recognit..

[9]  R. Manmatha,et al.  Word spotting for historical documents , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[10]  Prasenjit Dey,et al.  HMM-based Indic handwritten word recognition using zone segmentation , 2016, Pattern Recognit..

[11]  Cheng-Lin Liu,et al.  Lexicon-Driven Segmentation and Recognition of Handwritten Character Strings for Japanese Address Reading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Bidyut Baran Chaudhuri,et al.  A complete printed Bangla OCR system , 1998, Pattern Recognit..

[13]  Hermann Ney,et al.  Cross-language bootstrapping for unsupervised acoustic model training: rapid development of a Polish speech recognition system , 2009, INTERSPEECH.

[14]  Umapada Pal,et al.  A comparative study of features for handwritten Bangla text recognition , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[15]  Volkmar Frinken,et al.  A Novel Word Spotting Method Based on Recurrent Neural Networks , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Simon King,et al.  Cross-lingual portability of MLP-based tandem features - a case study for English and Hungarian , 2008, INTERSPEECH.

[17]  Andreas Keller,et al.  Lexicon-free handwritten word spotting using character HMMs , 2012, Pattern Recognit. Lett..

[18]  Partha Pratim Roy,et al.  A two phase trained Convolutional Neural Network for Handwritten Bangla Compound Character Recognition , 2017, 2017 Ninth International Conference on Advances in Pattern Recognition (ICAPR).

[19]  Horst Bunke,et al.  The IAM-database: an English sentence database for offline handwriting recognition , 2002, International Journal on Document Analysis and Recognition.

[20]  Feiping Nie,et al.  New primal SVM solver with linear computational cost for big data classifications , 2014, ICML 2014.

[21]  Horst Bunke,et al.  Recognition of cursive Roman handwriting: past, present and future , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[22]  Debashis Ghosh,et al.  Script Recognition—A Review , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Frank Lebourgeois,et al.  Towards an omnilingual word retrieval system for ancient manuscripts , 2009, Pattern Recognit..

[24]  R. Gray Entropy and Information Theory , 1990, Springer New York.

[25]  Volker Märgner,et al.  Arabic Handwriting Recognition Competition , 2005, ICDAR.

[26]  Partha Pratim Roy,et al.  Generation of synthetic training data for handwritten Indic script recognition , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[27]  Utpal Roy,et al.  Lexicon Reduction Technique for Bangla Handwritten Word Recognition , 2012, 2012 10th IAPR International Workshop on Document Analysis Systems.

[28]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[29]  Tonghua Su,et al.  Chinese Handwriting Recognition: An Algorithmic Perspective , 2013, Springer Briefs in Electrical and Computer Engineering.

[30]  Chafic Mokbel,et al.  Dynamic and Contextual Information in HMM Modeling for Handwritten Word Recognition , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Suman Bhoi,et al.  Handwritten text recognition in Odia script using Hidden Markov Model , 2015, 2015 Fifth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG).

[32]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[33]  Bidyut Baran Chaudhuri,et al.  Indian script character recognition: a survey , 2004, Pattern Recognit..

[34]  Bidyut Baran Chaudhuri,et al.  Handwritten Numeral Databases of Indian Scripts and Multistage Recognition of Mixed Numerals , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Robert Sabourin,et al.  Recognition and verification of unconstrained handwritten words , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Andreas Stolcke,et al.  Cross-Domain and Cross-Language Portability of Acoustic Features Estimated by Multilayer Perceptrons , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.