Sequential alignment attention model for scene text recognition

Abstract Scene text recognition has been a hot research topic in computer vision due to its various applications. The state-of-the-art solutions usually depend on the attention-based encoder-decoder framework that learns the mapping between input images and output sequences in a purely data-driven way. Unfortunately, there often exists severe misalignment between feature areas and text labels in real-world scenarios. To address this problem, this paper proposes a sequential alignment attention model to enhance the alignment between input images and output character sequences. In this model, an attention gated recurrent unit (AGRU) is first devised to distinguish the text and background regions, and further extract the localized features focusing on sequential text regions. Furthermore, CTC guided decoding strategy is integrated into the popular attention-based decoder, which not only helps to boost the convergence of the training but also enhances the well-aligned sequence recognition. Extensive experiments on various benchmarks, including the IIIT5k, SVT, and ICDAR datasets, show that our method substantially outperforms the state-of-the-art methods.

[1]  Shijian Lu,et al.  Accurate Scene Text Recognition Based on Recurrent Neural Network , 2014, ACCV.

[2]  Shijian Lu,et al.  Accurate recognition of words in scenes without character segmentation using recurrent neural network , 2017, Pattern Recognit..

[3]  Yongdong Zhang,et al.  Contextual Query Expansion for Image Retrieval , 2014, IEEE Transactions on Multimedia.

[4]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[5]  Xiang Bai,et al.  Robust Scene Text Recognition with Automatic Rectification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[7]  Andrew Zisserman,et al.  Reading Text in the Wild with Convolutional Neural Networks , 2014, International Journal of Computer Vision.

[8]  Shinji Watanabe,et al.  Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Hartmut Neven,et al.  PhotoOCR: Reading Text in Uncontrolled Conditions , 2013, 2013 IEEE International Conference on Computer Vision.

[10]  Wei Liu,et al.  Char-Net: A Character-Aware Neural Network for Distorted Scene Text Recognition , 2018, AAAI.

[11]  Tao Wang,et al.  End-to-end text recognition with convolutional neural networks , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[12]  Simon Osindero,et al.  Recursive Recurrent Nets with Attention Modeling for OCR in the Wild , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Pan He,et al.  Reading Scene Text in Deep Convolutional Sequences , 2015, AAAI.

[14]  Xiang Bai,et al.  ASTER: An Attentional Scene Text Recognizer with Flexible Rectification , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[16]  Shuigeng Zhou,et al.  Focusing Attention: Towards Accurate Text Recognition in Natural Images , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[17]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[18]  Wenyu Liu,et al.  Strokelets: A Learned Multi-scale Representation for Scene Text Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Ernest Valveny,et al.  ICDAR 2015 competition on Robust Reading , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[20]  Jiřı́ Matas,et al.  Real-time scene text localization and recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  C. V. Jawahar,et al.  Top-down and bottom-up cues for scene text recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Canjie Luo,et al.  STAN: A sequential transformation attention-based network for scene text recognition , 2021, Pattern Recognit..

[23]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[24]  Xianglong Liu,et al.  Forward and Backward Information Retention for Accurate Binary Neural Networks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Jon Almazán,et al.  ICDAR 2013 Robust Reading Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[26]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[27]  Xianglong Liu,et al.  BiPointNet: Binary Neural Network for Point Clouds , 2020, ICLR.

[28]  Kai Wang,et al.  End-to-end scene text recognition , 2011, 2011 International Conference on Computer Vision.

[29]  Hao Yu,et al.  SqueezedText: A Real-Time Scene Text Recognition by Binary Convolutional Encoder-Decoder Network , 2018, AAAI.

[30]  Andrew Zisserman,et al.  Deep Features for Text Spotting , 2014, ECCV.

[31]  Simon M. Lucas,et al.  ICDAR 2003 robust reading competitions , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[32]  Jian Zhang,et al.  Scene Text Recognition from Two-Dimensional Perspective , 2018, AAAI.

[33]  Palaiahnakote Shivakumara,et al.  A robust arbitrary text detection system for natural scene images , 2014, Expert Syst. Appl..

[34]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[35]  Albert Gordo,et al.  Supervised mid-level features for word image representation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  C. V. Jawahar,et al.  Scene Text Recognition using Higher Order Language Priors , 2009, BMVC.

[37]  Chunhua Shen,et al.  Towards End-to-End Text Spotting with Convolutional Recurrent Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Weiping Wang,et al.  SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[40]  Andrew Zisserman,et al.  Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition , 2014, ArXiv.

[41]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[43]  Tat-Seng Chua,et al.  SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Ankush Gupta,et al.  Synthetic Data for Text Localisation in Natural Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[46]  Nicu Sebe,et al.  Binary Neural Networks: A Survey , 2020, Pattern Recognit..

[47]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[48]  Shuigeng Zhou,et al.  AON: Towards Arbitrarily-Oriented Text Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[49]  Xiang Bai,et al.  An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Zihan Zhou,et al.  Learning to Read Irregular Text with Attention Mechanisms , 2017, IJCAI.