A Simple and Robust Convolutional-Attention Network for Irregular Text Recognition

Reading irregular scene text of arbitrary shape in natural images is still a challenging problem, despite the progress made recently. Many existing approaches incorporate sophisticated network structures to handle various shapes, use extra annotations for stronger supervision, or employ hard-to-train recurrent neural networks for sequence modeling. In this work, we propose a simple yet robust approach for scene text recognition. With no need to convert input images to sequence representations, we directly connect two-dimensional CNN features to an attention-based sequence decoder. As no recurrent module is adopted, our model can be trained in parallel. It achieves 1.7x to 10x acceleration to backward pass and 1.4x to 9x acceleration to forward pass, compared with the RNN counterparts. The proposed model is trained with only word-level annotations. With this simple design, our method achieves state-of-the-art or competitive recognition performance on the evaluated regular and irregular scene text benchmark datasets.

[1]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Shuigeng Zhou,et al.  Edit Probability for Scene Text Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3]  Jian Zhang,et al.  Scene Text Recognition from Two-Dimensional Perspective , 2018, AAAI.

[4]  Shuigeng Zhou,et al.  AON: Towards Arbitrarily-Oriented Text Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[6]  Palaiahnakote Shivakumara,et al.  A robust arbitrary text detection system for natural scene images , 2014, Expert Syst. Appl..

[7]  Dimosthenis Karatzas,et al.  Single Shot Scene Text Retrieval , 2018, ECCV.

[8]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[9]  Theo Gevers,et al.  Con-Text: Text Detection for Fine-Grained Object Classification , 2017, IEEE Transactions on Image Processing.

[10]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[11]  Jon Almazán,et al.  ICDAR 2013 Robust Reading Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[12]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[13]  Hanqing Lu,et al.  Reading Scene Text with Attention Convolutional Sequence Modeling , 2017, ArXiv.

[14]  Wei Liu,et al.  STAR-Net: A SpaTial Attention Residue Network for Scene Text Recognition , 2016, BMVC.

[15]  Xiang Bai,et al.  An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Kai Wang,et al.  End-to-end scene text recognition , 2011, 2011 International Conference on Computer Vision.

[17]  Ankush Gupta,et al.  Synthetic Data for Text Localisation in Natural Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Chunhua Shen,et al.  Towards End-to-End Text Spotting with Convolutional Recurrent Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[19]  Shuigeng Zhou,et al.  Focusing Attention: Towards Accurate Text Recognition in Natural Images , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[20]  Li Liu,et al.  SCAN: Sliding Convolutional Attention Network for Scene Text Recognition , 2018, ArXiv.

[21]  Khalil Sima'an,et al.  Multi30K: Multilingual English-German Image Descriptions , 2016, VL@ACL.

[22]  Jiebo Luo,et al.  Integrating Scene Text and Visual Appearance for Fine-Grained Image Classification , 2017, IEEE Access.

[23]  Bo Xu,et al.  NRTR: A No-Recurrence Sequence-to-Sequence Model for Scene Text Recognition , 2018, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[24]  Xiaolin Hu,et al.  Gated Recurrent Convolution Neural Network for OCR , 2017, NIPS.

[25]  Palaiahnakote Shivakumara,et al.  Recognizing Text with Perspective Distortion in Natural Scenes , 2013, 2013 IEEE International Conference on Computer Vision.

[26]  Peng Wang,et al.  Show, Attend and Read: A Simple and Strong Baseline for Irregular Text Recognition , 2018, AAAI.

[27]  Wei Liu,et al.  Char-Net: A Character-Aware Neural Network for Distorted Scene Text Recognition , 2018, AAAI.

[28]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Hao Yu,et al.  SqueezedText: A Real-Time Scene Text Recognition by Binary Convolutional Encoder-Decoder Network , 2018, AAAI.

[30]  Quoc V. Le,et al.  QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension , 2018, ICLR.

[31]  Xiang Bai,et al.  ASTER: An Attentional Scene Text Recognizer with Flexible Rectification , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Jiebo Luo,et al.  VizWiz Grand Challenge: Answering Visual Questions from Blind People , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Arnold W. M. Smeulders,et al.  Words Matter: Scene Text for Image Classification and Retrieval , 2017, IEEE Transactions on Multimedia.

[34]  Lukasz Kaiser,et al.  Universal Transformers , 2018, ICLR.

[35]  Shijian Lu,et al.  ESIR: End-To-End Scene Text Recognition via Iterative Image Rectification , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Lianwen Jin,et al.  A Multi-Object Rectified Attention Network for Scene Text Recognition , 2019, Pattern Recognit..

[37]  Zihan Zhou,et al.  Learning to Read Irregular Text with Attention Mechanisms , 2017, IJCAI.

[38]  Xiang Bai,et al.  Robust Scene Text Recognition with Automatic Rectification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Shuang Xu,et al.  Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Andrew Zisserman,et al.  Reading Text in the Wild with Convolutional Neural Networks , 2014, International Journal of Computer Vision.

[41]  Simon Osindero,et al.  Recursive Recurrent Nets with Attention Modeling for OCR in the Wild , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  C. V. Jawahar,et al.  Scene Text Recognition using Higher Order Language Priors , 2009, BMVC.