PMMN: Pre-trained multi-Modal network for scene text recognition

Abstract Scene Text Recognition (STR) task needs to consume large-amount data to develop a powerful recognizer, including visual data like images and linguistic data like texts. However, existing methods mainly leverage a one-stage training manner to train the entire framework end-to-end, which deeply relies on the well-annotated images and does not effectively use the data of the two modalities mentioned above. To solve this, in this paper, we propose a pre-trained multi-modal network (PMMN) that utilizes visual and linguistic data to pre-train the vision model and language model respectively to learn modality-specific knowledge for accurate scene text recognition. In detail, we first pre-train the proposed off-the-shelf vision model and language model to convergence. And then, we combine the pre-trained models in a unified framework for end-to-end fine-tuning and utilize the learned multi-modal information to interact with each other to generate robust features for character prediction. Extensive experiments are conducted to demonstrate the effectiveness of PMMN. The evaluation results on six benchmarks show that our proposed method exceeds most existing methods, achieving state-of-the-art performance.

[1]  Xiang Bai,et al.  An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Wei Liu,et al.  STAR-Net: A SpaTial Attention Residue Network for Scene Text Recognition , 2016, BMVC.

[3]  Jian Zhang,et al.  Scene Text Recognition from Two-Dimensional Perspective , 2018, AAAI.

[4]  Jinglun Shi,et al.  Video Captioning with Text-based Dynamic Attention and Step-by-Step Learning , 2019, Pattern Recognit. Lett..

[5]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[6]  Kai Wang,et al.  End-to-end scene text recognition , 2011, 2011 International Conference on Computer Vision.

[7]  Palaiahnakote Shivakumara,et al.  A robust arbitrary text detection system for natural scene images , 2014, Expert Syst. Appl..

[8]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[9]  Andrew Zisserman,et al.  Reading Text in the Wild with Convolutional Neural Networks , 2014, International Journal of Computer Vision.

[10]  Shuigeng Zhou,et al.  Focusing Attention: Towards Accurate Text Recognition in Natural Images , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[11]  Yongdong Zhang,et al.  Attention and Language Ensemble for Scene Text Recognition with Convolutional Sequence Modeling , 2018, ACM Multimedia.

[12]  Hui Yang,et al.  PlugNet: Degradation Aware Scene Text Recognition Supervised by a Pluggable Super-Resolution Unit , 2020, ECCV.

[13]  Pan He,et al.  Reading Scene Text in Deep Convolutional Sequences , 2015, AAAI.

[14]  Xiang Bai,et al.  ASTER: An Attentional Scene Text Recognizer with Flexible Rectification , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Xiang Bai,et al.  Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  C. V. Jawahar,et al.  Scene Text Recognition using Higher Order Language Priors , 2009, BMVC.

[17]  Zihan Zhou,et al.  Learning to Read Irregular Text with Attention Mechanisms , 2017, IJCAI.

[18]  Yibo Liu,et al.  2D-CTC for Scene Text Recognition , 2019, ArXiv.

[19]  Denis Coquenet,et al.  Recurrence-free unconstrained handwritten text recognition using gated fully convolutional network , 2020, 2020 17th International Conference on Frontiers in Handwriting Recognition (ICFHR).

[20]  Peng Wang,et al.  Show, Attend and Read: A Simple and Strong Baseline for Irregular Text Recognition , 2018, AAAI.

[21]  Weiping Wang,et al.  SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Jun Hou,et al.  GTC: Guided Training of CTC Towards Efficient and Accurate Scene Text Recognition , 2020, AAAI.

[23]  Xiaoyong Shen,et al.  2D Attentional Irregular Scene Text Recognizer , 2019, ArXiv.

[24]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[25]  Hong-Han Shuai,et al.  Spatiotemporal Dilated Convolution With Uncertain Matching for Video-Based Crowd Estimation , 2021, IEEE Transactions on Multimedia.

[26]  Liang Wu,et al.  Editing Text in the Wild , 2019, ACM Multimedia.

[27]  Lianwen Jin,et al.  Decoupled Attention Network for Text Recognition , 2019, AAAI.

[28]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[29]  Ernest Valveny,et al.  ICDAR 2015 competition on Robust Reading , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[30]  Xinlei Chen,et al.  Towards VQA Models That Can Read , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Wen-Huang Cheng,et al.  A comparative study of data fusion for RGB-D based visual recognition , 2016, Pattern Recognit. Lett..

[32]  Xiang Bai,et al.  Robust Scene Text Recognition with Automatic Rectification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[34]  Andrew Zisserman,et al.  Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition , 2014, ArXiv.

[35]  R. Manmatha,et al.  SCATTER: Selective Context Attentional Scene Text Recognizer , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Bo Xu,et al.  NRTR: A No-Recurrence Sequence-to-Sequence Model for Scene Text Recognition , 2018, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[37]  Errui Ding,et al.  Towards Accurate Scene Text Recognition With Semantic Reasoning Networks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[39]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Xiang Bai,et al.  TextScanner: Reading Characters in Order for Robust Scene Text Recognition , 2019, AAAI.

[41]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[42]  Hong-Han Shuai,et al.  ROSNet: Robust one-stage network for CT lesion detection , 2021, Pattern Recognit. Lett..

[43]  Zhanghui Kuang,et al.  RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition , 2020, ECCV.

[44]  Kaigui Bian,et al.  Symmetry-Constrained Rectification Network for Scene Text Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[45]  Palaiahnakote Shivakumara,et al.  Recognizing Text with Perspective Distortion in Natural Scenes , 2013, 2013 IEEE International Conference on Computer Vision.

[46]  Shuigeng Zhou,et al.  AON: Towards Arbitrarily-Oriented Text Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47]  Margrit Betke,et al.  Deep Neural Network for Semantic-based Text Recognition in Images , 2019, ArXiv.

[48]  Hong-Han Shuai,et al.  A Human-Like Traffic Scene Understanding System: A Survey , 2021, IEEE Industrial Electronics Magazine.

[49]  Jon Almazán,et al.  ICDAR 2013 Robust Reading Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[50]  Sheng Tang,et al.  Perspective-Adaptive Convolutions for Scene Parsing , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Chunheng Wang,et al.  Adversarial learning based attentional scene text recognizer , 2020, Pattern Recognit. Lett..

[52]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[53]  Ankush Gupta,et al.  Synthetic Data for Text Localisation in Natural Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Lianwen Jin,et al.  SaHAN: Scale-aware hierarchical attention network for scene text recognition , 2020, Pattern Recognit. Lett..

[55]  LiYan,et al.  Convolutional Attention Networks for Scene Text Recognition , 2019 .