Density based semi-automatic labeling on multi-feature representations for ground truth generation: Application to handwritten character recognition

Abstract A huge number of labeled samples are required as training data to construct an efficient recognition mechanism for an optical character recognition system. Although samples of characters can be easily collected from available manuscripts, they often lack class labels, especially for ancient and local alphabets. The creation of a training dataset requires a great number of characters manually annotated by experts. It is a costly and time-consuming process. To considerably reduce the human effort required in the construction of training datasets, a novel semi-automatic labeling method is proposed in this work under the assumption that there are no initial labeled samples. The proposed method performs an iterative procedure on a nearest neighbor graph that views samples in multiple feature spaces. In each iteration, an expert is first called upon to label a relevant unlabeled sample that is automatically selected from the highest density area of unlabeled samples. Then, the manually annotated label is propagated to the neighbor samples with safe conditions based on sample density and multi-views. The procedure is repeated until all unlabeled samples are labeled. The labeling procedure of the proposed method is evaluated on MNIST, Devanagari, Thai, and Lanna Dhamma datasets. The results show that the proposed method outperforms state-of-the-art labeling methods, achieving the highest labeling accuracy. In addition, it can handle outlier samples and deal with alphabets that include visually similar characters. Moreover, the recognition performance of the classifier trained by using the semi-automatically generated training dataset is comparable with that classifier trained by actual ground truth.

[1]  Victor Storchan Data Augmentation via Adversarial Networks for Optical Character Recognition/Conference Submissions , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[2]  Munish Kumar,et al.  Ancient text recognition: a review , 2020, Artificial Intelligence Review.

[3]  Binod Kumar Prasad,et al.  English Numerals Recognition System Using Novel Curve Coding , 2020 .

[4]  Hubert Cecotti,et al.  Active graph based semi-supervised learning using image matching: Application to handwritten digit recognition , 2016, Pattern Recognit. Lett..

[5]  Mark Craven,et al.  An Analysis of Active Learning Strategies for Sequence Labeling Tasks , 2008, EMNLP.

[6]  Lambert Schomaker,et al.  Recognition of handwritten characters using local gradient feature descriptors , 2015, Eng. Appl. Artif. Intell..

[7]  Vahid Ghods,et al.  An efficient character recognition method using enhanced HOG for spam image detection , 2019, Soft Comput..

[8]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[9]  U. Vollmer-Conna,et al.  Serum Leukocyte Immunoglobulin-Like Receptor A3 (LILRA3) Is Increased in Patients with Multiple Sclerosis and Is a Strong Independent Indicator of Disease Severity; 6.7kbp LILRA3 Gene Deletion Is Not Associated with Diseases Susceptibility , 2016, PloS one.

[10]  Simon Liao,et al.  Chinese character recognition by Zernike moments , 2014, 2014 International Conference on Audio, Language and Image Processing.

[11]  Hassan Qjidaa,et al.  A fast computation of charlier moments for binary and gray-scale images , 2012, 2012 Colloquium in Information Science and Technology.

[12]  Rafael Llobet,et al.  Training Set Expansion in Handwritten Character Recognition , 2002, SSPR/SPR.

[13]  Noor A. Jebril,et al.  Recognition of Handwritten Arabic Characters using Histograms of Oriented Gradient (HOG) , 2018 .

[14]  Weiqiang Wang,et al.  Data augmentation and directional feature maps extraction for in-air handwritten Chinese character recognition based on convolutional neural network , 2018, Pattern Recognit. Lett..

[15]  Gernot A. Fink,et al.  A Semi-supervised Ensemble Learning Approach for Character Labeling with Minimal Human Effort , 2011, 2011 International Conference on Document Analysis and Recognition.

[16]  Diederik P. Kingma,et al.  An Introduction to Variational Autoencoders , 2019, Found. Trends Mach. Learn..

[17]  Bouchaib Cherradi,et al.  Handwritten Arabic Words Recognition System Based on HOG and Gabor Filter Descriptors , 2020, 2020 1st International Conference on Innovative Research in Applied Science, Engineering and Technology (IRASET).

[18]  Gernot A. Fink,et al.  Semi-supervised learning for character recognition in historical archive documents , 2014, Pattern Recognit..

[19]  Mohammad Shahadat Hossain,et al.  Bangla Handwritten Character Recognition using Convolutional Neural Network with Data Augmentation , 2019, 2019 Joint 8th International Conference on Informatics, Electronics & Vision (ICIEV) and 2019 3rd International Conference on Imaging, Vision & Pattern Recognition (icIVPR).

[20]  Ladislav Lenc,et al.  Training Strategies for OCR Systems for Historical Documents , 2019, AIAI.

[21]  M. Rashad,et al.  Arabic character recognition using statistical and geometric moment features , 2012, 2012 Japan-Egypt Conference on Electronics, Communications and Computers.

[22]  Teresa Gonçalves,et al.  Recognition of similar characters using gradient features of discriminative regions , 2019, Expert Syst. Appl..

[23]  Shailesh Acharya,et al.  Deep learning based large scale handwritten Devanagari character recognition , 2015, 2015 9th International Conference on Software, Knowledge, Information Management and Applications (SKIMA).

[24]  Shinn-Ying Ho,et al.  Recognition of handwritten Lanna Dhamma characters using a set of optimally designed moment features , 2017, International Journal on Document Analysis and Recognition (IJDAR).

[25]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[26]  Hubert Cecotti,et al.  Semi-automatic ground truth generation using unsupervised clustering and limited manual labeling: Application to handwritten character recognition , 2015, Pattern Recognit. Lett..

[27]  Sutat Sae-Tang,et al.  Thai handwritten character corpus , 2004, IEEE International Symposium on Communications and Information Technology, 2004. ISCIT 2004..

[28]  Udo Hahn,et al.  Semi-Supervised Active Learning for Sequence Labeling , 2009, ACL.

[29]  Ananthram Swami,et al.  The Limitations of Deep Learning in Adversarial Settings , 2015, 2016 IEEE European Symposium on Security and Privacy (EuroS&P).

[30]  Eduardo Coutinho,et al.  Semi-Supervised Active Learning for Sound Classification in Hybrid Learning Environments , 2016, PloS one.

[31]  Apostolos Antonacopoulos,et al.  Efficient and effective OCR engine training , 2019, International Journal on Document Analysis and Recognition (IJDAR).

[32]  Ram Kumar Karsh,et al.  Handwritten Character Recognition Using KNN and SVM Based Classifier over Feature Vector from Autoencoder , 2020, MIND.

[33]  Abdel Belaïd,et al.  A Fast Learning Strategy Using Pattern Selection for Feedforward Neural Networks , 2006 .

[34]  Satoshi Naoi,et al.  Beyond human recognition: A CNN-based framework for handwritten character recognition , 2015, 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR).

[35]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[36]  Teresa Gonçalves,et al.  Handwritten Character Recognition Using Active Semi-supervised Learning , 2018, IDEAL.

[37]  J. Lafferty,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[38]  Sumit Srivastava,et al.  Similar handwritten devanagari character recognition by critical region estimation , 2016, 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[39]  Saeed Aghabozorgi,et al.  A New Dataset Size Reduction Approach for PCA-Based Classification in OCR Application , 2014 .

[40]  Amit Choudhary,et al.  Hybrid CNN-SVM Classifier for Handwritten Digit Recognition , 2020 .