EPAN: Effective parts attention network for scene text recognition

Abstract For most previous attention-based scene text recognition methods, images are transformed into high-level feature vectors that form a feature map with height equal to one. Such vectors may contain unnecessary noise that limits recognition performance. To address this issue, in this paper, we propose the effective parts attention network (EPAN) which can attentively highlight the character region for more precise recognition. EPAN consists of a text image encoder and character effective parts decoder (CEPD), and it is end-to-end trainable. The former separates the high-dimensional feature map into one-dimensional vectors row-by-row, which are connected to a bidirectional long short term memory unit to encode contextual information. Subsequently, the CEPD transforms the vectors using a novel glimpse network at each time step to roughly determine the position of the characters. Then the CEPD uses a refinement network to generate a mask to gradually localize the precise position of important parts of the current character. Experiments were conducted on various benchmarks, including IIIT5K-Words, Street View Text, ICDAR 2003, ICDAR 2013, CUTE80, Street View Text Perspective, and ICDAR 2015, which demonstrated that the proposed EPAN method significantly outperformed or was comparable to existing methods in terms of lexicon-free word accuracy. Additionally, substantial qualitative results further demonstrated the robustness of our method.

[1]  Shijian Lu,et al.  Accurate Scene Text Recognition Based on Recurrent Neural Network , 2014, ACCV.

[2]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[3]  Palaiahnakote Shivakumara,et al.  A robust arbitrary text detection system for natural scene images , 2014, Expert Syst. Appl..

[4]  Alexander M. Rush,et al.  Image-to-Markup Generation with Coarse-to-Fine Attention , 2016, ICML.

[5]  Shuigeng Zhou,et al.  Edit Probability for Scene Text Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Hartmut Neven,et al.  PhotoOCR: Reading Text in Uncontrolled Conditions , 2013, 2013 IEEE International Conference on Computer Vision.

[7]  Jing Liu,et al.  Image captioning with triple-attention and stack parallel LSTM , 2018, Neurocomputing.

[8]  Shijian Lu,et al.  Accurate recognition of words in scenes without character segmentation using recurrent neural network , 2017, Pattern Recognit..

[9]  Yang Liu,et al.  Synthetically Supervised Feature Learning for Scene Text Recognition , 2018, ECCV.

[10]  Wenyu Liu,et al.  Strokelets: A Learned Multi-scale Representation for Scene Text Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Hui Wu,et al.  Natural scene text detection by multi-scale adaptive color clustering and non-text filtering , 2016, Neurocomputing.

[12]  C. V. Jawahar,et al.  Whole is Greater than Sum of Parts: Recognizing Scene Text Words , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[13]  Qi Wang,et al.  An Incremental Framework for Video-Based Traffic Sign Detection, Tracking, and Recognition , 2017, IEEE Transactions on Intelligent Transportation Systems.

[14]  Changxin Gao,et al.  Text detection approach based on confidence map and context information , 2015, Neurocomputing.

[15]  Xiang Bai,et al.  Robust Scene Text Recognition with Automatic Rectification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Qiang Guo,et al.  Convolutional feature learning and Hybrid CNN-HMM for scene number recognition , 2016, Neurocomputing.

[17]  Shuigeng Zhou,et al.  AON: Towards Arbitrarily-Oriented Text Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Wei Liu,et al.  STAR-Net: A SpaTial Attention Residue Network for Scene Text Recognition , 2016, BMVC.

[19]  Xiang Bai,et al.  An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Kai Wang,et al.  End-to-end scene text recognition , 2011, 2011 International Conference on Computer Vision.

[21]  Lianwen Jin,et al.  A New CNN-Based Method for Multi-Directional Car License Plate Detection , 2018, IEEE Transactions on Intelligent Transportation Systems.

[22]  Albert Gordo,et al.  Supervised mid-level features for word image representation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Jiřı́ Matas,et al.  Real-time scene text localization and recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Simon M. Lucas,et al.  ICDAR 2003 robust reading competitions , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[25]  Palaiahnakote Shivakumara,et al.  Recognizing Text with Perspective Distortion in Natural Scenes , 2013, 2013 IEEE International Conference on Computer Vision.

[26]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[27]  Qi Wang,et al.  VSSA-NET: Vertical Spatial Sequence Attention Network for Traffic Sign Detection , 2019, IEEE Transactions on Image Processing.

[28]  Jon Almazán,et al.  ICDAR 2013 Robust Reading Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[29]  Zihan Zhou,et al.  Learning to Read Irregular Text with Attention Mechanisms , 2017, IJCAI.

[30]  Ankush Gupta,et al.  Synthetic Data for Text Localisation in Natural Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Kai Wang,et al.  Word Spotting in the Wild , 2010, ECCV.

[32]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[34]  Andrew Zisserman,et al.  Deep Features for Text Spotting , 2014, ECCV.

[35]  David S. Doermann,et al.  Text Detection and Recognition in Imagery: A Survey , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[37]  Shuigeng Zhou,et al.  Focusing Attention: Towards Accurate Text Recognition in Natural Images , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Yi-Chao Wu,et al.  Scene Text Recognition with Sliding Convolutional Character Models , 2017, ArXiv.

[39]  Zihan Zhou,et al.  Improving Offline Handwritten Chinese Character Recognition by Iterative Refinement , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[40]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Hao Yu,et al.  SqueezedText: A Real-Time Scene Text Recognition by Binary Convolutional Encoder-Decoder Network , 2018, AAAI.

[42]  Chunna Tian,et al.  Text detection in natural scene images based on color prior guided MSER , 2018, Neurocomputing.

[43]  Tao Wang,et al.  End-to-end text recognition with convolutional neural networks , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[44]  Julien Rabin,et al.  Wasserstein Barycenter and Its Application to Texture Mixing , 2011, SSVM.

[45]  Ernest Valveny,et al.  ICDAR 2015 competition on Robust Reading , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[46]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[47]  Xiang Bai,et al.  ASTER: An Attentional Scene Text Recognizer with Flexible Rectification , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[49]  Ernest Valveny,et al.  Word Spotting and Recognition with Embedded Attributes , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Yidong Chen,et al.  Lattice-to-sequence attentional Neural Machine Translation models , 2018, Neurocomputing.

[51]  Simon Osindero,et al.  Recursive Recurrent Nets with Attention Modeling for OCR in the Wild , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  C. V. Jawahar,et al.  Scene Text Recognition using Higher Order Language Priors , 2009, BMVC.

[53]  Chunna Tian,et al.  Natural scene text detection with MC-MR candidate extraction and coarse-to-fine filtering , 2017, Neurocomputing.

[54]  Albert Gordo,et al.  Label Embedding: A Frugal Baseline for Text Recognition , 2015, International Journal of Computer Vision.