Towards End-to-End Text Spotting in Natural Scenes

Text spotting in natural scene images is of great importance for many image understanding tasks. It includes two sub-tasks: text detection and recognition. In this work, we propose a unified network that simultaneously localizes and recognizes text with a single forward pass, avoiding intermediate processes such as image cropping and feature re-calculation, word separation, and character grouping. In contrast to existing approaches that consider text detection and recognition as two distinct tasks and tackle them one by one, the proposed framework settles these two tasks concurrently. The whole framework can be trained end-to-end and is able to handle text of arbitrary shapes. The convolutional features are calculated only once and shared by both detection and recognition modules. Through multi-task training, the learned features become more discriminate and improve the overall performance. By employing the $2$D attention model in word recognition, the irregularity of text can be robustly addressed. It provides the spatial location for each character, which not only helps local feature extraction in word recognition, but also indicates an orientation angle to refine text localization. Our proposed method has achieved state-of-the-art performance on several standard text spotting benchmarks, including both regular and irregular ones.

[1]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[2]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Shuigeng Zhou,et al.  Focusing Attention: Towards Accurate Text Recognition in Natural Images , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4]  Xiang Bai,et al.  TextBoxes++: A Single-Shot Oriented Scene Text Detector , 2018, IEEE Transactions on Image Processing.

[5]  Errui Ding,et al.  TextNet: Irregular Text Reading from Images with an End-to-End Trainable Network , 2018, ACCV.

[6]  David S. Doermann,et al.  Text Detection and Recognition in Imagery: A Survey , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Simon Osindero,et al.  Recursive Recurrent Nets with Attention Modeling for OCR in the Wild , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Shuchang Zhou,et al.  EAST: An Efficient and Accurate Scene Text Detector , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Xiang Bai,et al.  An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Shuigeng Zhou,et al.  Edit Probability for Scene Text Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Chee Seng Chan,et al.  Total-Text: A Comprehensive Dataset for Scene Text Detection and Recognition , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[12]  Xiang Bai,et al.  Robust Scene Text Recognition with Automatic Rectification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Zihan Zhou,et al.  Learning to Read Irregular Text with Attention Mechanisms , 2017, IJCAI.

[14]  Xin He,et al.  Scene Text Detection and Recognition: The Deep Learning Era , 2018, International Journal of Computer Vision.

[15]  Pan He,et al.  Reading Scene Text in Deep Convolutional Sequences , 2015, AAAI.

[16]  Xiang Bai,et al.  ASTER: An Attentional Scene Text Recognizer with Flexible Rectification , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Wenyu Liu,et al.  Multi-oriented Text Detection with Fully Convolutional Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Alessandro Bissacco,et al.  Towards Unconstrained End-to-End Text Spotting , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Xiang Bai,et al.  Symmetry-based text line detection in natural scenes , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Jiri Matas,et al.  Deep TextSpotter: An End-to-End Trainable Scene Text Localization and Recognition Framework , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[22]  Lianwen Jin,et al.  Deep Matching Prior Network: Toward Tighter Multi-oriented Text Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Lianwen Jin,et al.  DeepText: A new approach for text proposal generation and text detection in natural images , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Changming Sun,et al.  An End-to-End TextSpotter with Explicit Alignment and Attention , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Shuicheng Yan,et al.  Multi-oriented Scene Text Detection via Corner Localization and Region Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Xiangyang Xue,et al.  Arbitrary-Oriented Scene Text Detection via Rotation Proposals , 2017, IEEE Transactions on Multimedia.

[28]  Xiang Bai,et al.  Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Chunhua Shen,et al.  Towards End-to-End Text Spotting with Convolutional Recurrent Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[30]  Wei Liu,et al.  Char-Net: A Character-Aware Neural Network for Distorted Scene Text Recognition , 2018, AAAI.

[31]  Andrew Zisserman,et al.  Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition , 2014, ArXiv.

[32]  Xiaolin Li,et al.  Single Shot Text Detector with Regional Attention , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[33]  Kaizhu Huang,et al.  Robust Text Detection in Natural Scene Images , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Jon Almazán,et al.  ICDAR 2013 Robust Reading Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[35]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[36]  Linjie Xing,et al.  Convolutional Character Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  Andrew Zisserman,et al.  Reading Text in the Wild with Convolutional Neural Networks , 2014, International Journal of Computer Vision.

[38]  Wei Feng,et al.  TextDragon: An End-to-End Framework for Arbitrary Shaped Text Spotting , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[39]  Shuchang Zhou,et al.  Scene Text Detection via Holistic, Multi-Channel Prediction , 2016, ArXiv.

[40]  Junjie Yan,et al.  FOTS: Fast Oriented Text Spotting with a Unified Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Dimosthenis Karatzas,et al.  TextProposals: A text-specific selective search algorithm for word spotting in the wild , 2016, Pattern Recognit..

[42]  Wafa Khlif,et al.  ICDAR2017 Robust Reading Challenge on Multi-Lingual Scene Text Detection and Script Identification - RRC-MLT , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[43]  Peng Wang,et al.  Show, Attend and Read: A Simple and Strong Baseline for Irregular Text Recognition , 2018, AAAI.

[44]  Ankush Gupta,et al.  Synthetic Data for Text Localisation in Natural Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[46]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[47]  Xiang Li,et al.  Shape Robust Text Detection With Progressive Scale Expansion Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Xu-Cheng Yin,et al.  Text Detection, Tracking and Recognition in Video: A Comprehensive Survey , 2016, IEEE Transactions on Image Processing.

[49]  Jiri Matas,et al.  COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images , 2016, ArXiv.

[50]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[51]  Kai Wang,et al.  End-to-end scene text recognition , 2011, 2011 International Conference on Computer Vision.

[52]  Andrew Zisserman,et al.  Deep Features for Text Spotting , 2014, ECCV.

[53]  Jiri Matas,et al.  Real-Time Lexicon-Free Scene Text Localization and Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  Xiaoyong Shen,et al.  Learning Shape-Aware Embedding for Scene Text Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Shuigeng Zhou,et al.  AON: Towards Arbitrarily-Oriented Text Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[56]  Gang Yu,et al.  Scene Text Detection with Supervised Pyramid Context Network , 2018, AAAI.

[57]  Wenyu Liu,et al.  TextBoxes: A Fast Text Detector with a Single Deep Neural Network , 2016, AAAI.

[58]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[59]  Hao Chen,et al.  ABCNet: Real-Time Scene Text Spotting With Adaptive Bezier-Curve Network , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Weilin Huang,et al.  Robust Scene Text Detection with Convolution Neural Network Induced MSER Trees , 2014, ECCV.

[61]  Xiang Bai,et al.  Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[62]  Pan He,et al.  Detecting Text in Natural Image with Connectionist Text Proposal Network , 2016, ECCV.

[63]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[64]  Xiaolin Hu,et al.  Gated Recurrent Convolution Neural Network for OCR , 2017, NIPS.

[65]  Shitala Prasad,et al.  Using Object Information for Spotting Text , 2018, ECCV.

[66]  Errui Ding,et al.  Look More Than Once: An Accurate Detector for Text of Arbitrary Shapes , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Xiang Bai,et al.  Scene text detection and recognition: recent advances and future trends , 2015, Frontiers of Computer Science.

[68]  Shijian Lu,et al.  ESIR: End-To-End Scene Text Recognition via Iterative Image Rectification , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).