Towards the Unseen: Iterative Text Recognition by Distilling from Errors

We introduce a meta-framework that could be incorporated on the top of most of the SOTA text recognition frameworks. Along with seminal work like ASTER by Shi et al. [16], we consider some latest text recognition frameworks like Show, Attend and Read (SAR) [12] and SCATTER [14] as our baseline, upon which we validate our iterative framework. SAR extends the work of Shi et al. using 2D attention mechanism to eliminate the rectification network. On the other side, SCATTER additionally couples multiple BLSTM encoders for richer context modelling, followed by a gating mechanism to balance between context rich information and backbone CNN features. Here, we additionally add the results using lexicons for different datasets under both for DS and CS setups. Furthermore, we add graphs to focus on by what margin our framework can provide gain over respective baseline SOTA frameworks under both DS and CS setups. Moreover, recognising rarely/unseen words being the major focus of this work, it can seen clearly seen that we achieve a rather significant performance gain on both STR (5-7%) and HTR (10-12%) tasks under DS set up. Nevertheless, we obtain a reasonable gain under CS set up unanimously over all the studied baseline SOTA frameworks.

[1]  Kai Wang,et al.  End-to-end scene text recognition , 2011, 2011 International Conference on Computer Vision.

[2]  Gregory Shakhnarovich,et al.  Deep Back-Projection Networks for Super-Resolution , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3]  Yuhong Guo,et al.  Zero-Shot Classification with Discriminative Semantic Representation Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Serge J. Belongie,et al.  Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[5]  Kaigui Bian,et al.  Symmetry-Constrained Rectification Network for Scene Text Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Andrew Zisserman,et al.  Deep Structured Output Learning for Unconstrained Text Recognition , 2014, ICLR.

[7]  Canjie Luo,et al.  Learn to Augment: Joint Data Augmentation and Network Optimization for Text Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Xiaoyong Shen,et al.  2D Attentional Irregular Scene Text Recognizer , 2019, ArXiv.

[9]  Maxine Eskénazi,et al.  Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders , 2017, ACL.

[10]  Jitendra Malik,et al.  Human Pose Estimation with Iterative Error Feedback , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[12]  Xiang Bai,et al.  Scene Text Image Super-Resolution in the Wild , 2020, ECCV.

[13]  Seong Joon Oh,et al.  What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Raymond Hendy Susanto,et al.  The CoNLL-2014 Shared Task on Grammatical Error Correction , 2014 .

[15]  Weiping Wang,et al.  SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[17]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[18]  Jianfei Cai,et al.  Pluralistic Image Completion , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Horst Bunke,et al.  The IAM-database: an English sentence database for offline handwriting recognition , 2002, International Journal on Document Analysis and Recognition.

[20]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  N. A-R A E H A N,et al.  Detecting errors in English article usage by non-native speakers , 2006 .

[22]  Andrew Zisserman,et al.  Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition , 2014, ArXiv.

[23]  R. Manmatha,et al.  SCATTER: Selective Context Attentional Scene Text Recognizer , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[25]  Rui Yao,et al.  CANet: Class-Agnostic Segmentation Networks With Iterative Refinement and Attentive Few-Shot Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Xiang Bai,et al.  An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Lin Sun,et al.  Feedback Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Ankush Gupta,et al.  Synthetic Data for Text Localisation in Natural Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Jon Almazán,et al.  ICDAR 2013 Robust Reading Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[30]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[31]  Shijian Lu,et al.  Verisimilar Image Synthesis for Accurate Detection and Recognition of Texts in Scenes , 2018, ECCV.

[32]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[33]  Palaiahnakote Shivakumara,et al.  Recognizing Text with Perspective Distortion in Natural Scenes , 2013, 2013 IEEE International Conference on Computer Vision.

[34]  Shuigeng Zhou,et al.  AON: Towards Arbitrarily-Oriented Text Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Pau Riba,et al.  Candidate Fusion: Integrating Language Modelling into a Sequence-to-Sequence Handwritten Word Recognition Architecture , 2019, Pattern Recognit..

[36]  Rynson W. H. Lau,et al.  Distraction-Aware Shadow Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Shijian Lu,et al.  ESIR: End-To-End Scene Text Recognition via Iterative Image Rectification , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Ernest Valveny,et al.  ICDAR 2015 competition on Robust Reading , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[39]  Xiang Bai,et al.  Robust Scene Text Recognition with Automatic Rectification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Zhanghui Kuang,et al.  RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition , 2020, ECCV.

[41]  Shuigeng Zhou,et al.  Focusing Attention: Towards Accurate Text Recognition in Natural Images , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[42]  Jiřı́ Matas,et al.  Real-time scene text localization and recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Simon Osindero,et al.  Recursive Recurrent Nets with Attention Modeling for OCR in the Wild , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Partha Pratim Roy,et al.  Handwriting Recognition in Low-Resource Scripts Using Adversarial Learning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Jiebo Luo,et al.  On Vocabulary Reliance in Scene Text Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Ernest Valveny,et al.  Scene Text Visual Question Answering , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[47]  Lihi Zelnik-Manor,et al.  Adversarial Feedback Loop , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[48]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[49]  Palaiahnakote Shivakumara,et al.  A robust arbitrary text detection system for natural scene images , 2014, Expert Syst. Appl..

[50]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[51]  C. V. Jawahar,et al.  Scene Text Recognition using Higher Order Language Priors , 2009, BMVC.

[52]  Heng Tao Shen,et al.  Sequence-To-Sequence Domain Adaptation Network for Robust Text Image Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Honglak Lee,et al.  Learning Structured Output Representation using Deep Conditional Generative Models , 2015, NIPS.

[54]  Heng Tao Shen,et al.  What Machines See Is Not What They Get: Fooling Scene Text Recognition Models With Adversarial Text Images , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Daniel Jurafsky,et al.  Neural Language Correction with Character-Based Attention , 2016, ArXiv.

[56]  Andrew Zisserman,et al.  Reading Text in the Wild with Convolutional Neural Networks , 2014, International Journal of Computer Vision.

[57]  Peng Wang,et al.  Show, Attend and Read: A Simple and Strong Baseline for Irregular Text Recognition , 2018, AAAI.

[58]  Taesung Park,et al.  Semantic Image Synthesis With Spatially-Adaptive Normalization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Lior Wolf,et al.  CNN-N-Gram for HandwritingWord Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Ning Zhang,et al.  Feedback Adversarial Learning: Spatial Feedback for Improving Generative Adversarial Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Xiang Bai,et al.  ASTER: An Attentional Scene Text Recognizer with Flexible Rectification , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[62]  Xinlei Chen,et al.  Iterative Visual Reasoning Beyond Convolutions , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[63]  Tara N. Sainath,et al.  A Spelling Correction Model for End-to-end Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[64]  Jitendra Malik,et al.  Iterative Instance Segmentation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Dan Roth,et al.  Generating Confusion Sets for Context-Sensitive Error Correction , 2010, EMNLP.

[66]  Yoshua Bengio,et al.  On Using Monolingual Corpora in Neural Machine Translation , 2015, ArXiv.