Decoupling Recognition from Detection: Single Shot Self-Reliant Scene Text Spotter

Typical text spotters follow the two-stage spotting strategy: detect the precise boundary for a text instance first and then perform text recognition within the located text region. While such strategy has achieved substantial progress, there are two underlying limitations. 1) The performance of text recognition depends heavily on the precision of text detection, resulting in the potential error propagation from detection to recognition. 2) The RoI cropping which bridges the detection and recognition brings noise from background and leads to information loss when pooling or interpolating from feature maps. In this work we propose the single shot Self-Reliant Scene Text Spotter (SRSTS), which circumvents these limitations by decoupling recognition from detection. Specifically, we conduct text detection and recognition in parallel and bridge them by the shared positive anchor point. Consequently, our method is able to recognize the text instances correctly even though the precise text boundaries are challenging to detect. Additionally, our method reduces the annotation cost for text detection substantially. Extensive experiments on regular-shaped benchmark and arbitrary-shaped benchmark demonstrate that our SRSTS compares favorably to previous state-of-the-art spotters in terms of both accuracy and efficiency.

[1]  Chunhua Shen,et al.  ABCNet v2: Adaptive Bezier-Curve Network for Real-Time End-to-End Text Spotting , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Chunhua Shen,et al.  PAN++: Towards Efficient and Accurate End-to-End Spotting of Arbitrarily-Shaped Text , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Guangming Shi,et al.  PGNet: Real-time Arbitrarily-Shaped Text Spotting with Point Gathering Network , 2021, AAAI.

[4]  Shiliang Pu,et al.  MANGO: A Mask Attention Guided One-Stage Scene Text Spotter , 2020, AAAI.

[5]  Zheng-Jun Zha,et al.  R-Net: A Relationship Network for Efficient and Accurate Scene Text Detection , 2020, IEEE Transactions on Multimedia.

[6]  Xiang Bai,et al.  Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Jeonghun Baek,et al.  Character Region Attention For Text Spotting , 2020, ECCV.

[8]  Jing Huang,et al.  Mask TextSpotter v3: Segmentation Proposal Network for Robust Scene Text Spotting , 2020, ECCV.

[9]  Hao Chen,et al.  ABCNet: Real-Time Scene Text Spotting With Adaptive Bezier-Curve Network , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Fei Wu,et al.  Text Perceptron: Towards End-to-End Arbitrary-Shaped Text Spotting , 2020, AAAI.

[11]  Hao Wang,et al.  All You Need Is Boundary: Toward Arbitrary-Shaped Text Spotting , 2019, AAAI.

[12]  Quoc V. Le,et al.  EfficientDet: Scalable and Efficient Object Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Linjie Xing,et al.  Convolutional Character Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Wei Feng,et al.  TextDragon: An End-to-End Framework for Arbitrary Shaped Text Spotting , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Lianwen Jin,et al.  ICDAR2019 Robust Reading Challenge on Arbitrary-Shaped Text - RRC-ArT , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[16]  Alessandro Bissacco,et al.  Towards Unconstrained End-to-End Text Spotting , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Yong Jae Lee,et al.  YOLACT: Real-Time Instance Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Errui Ding,et al.  TextNet: Irregular Text Reading from Images with an End-to-End Trainable Network , 2018, ACCV.

[19]  Xiang Bai,et al.  TextBoxes++: A Single-Shot Oriented Scene Text Detector , 2018, IEEE Transactions on Image Processing.

[20]  Junjie Yan,et al.  FOTS: Fast Oriented Text Spotting with a Unified Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Wafa Khlif,et al.  ICDAR2017 Robust Reading Challenge on Multi-Lingual Scene Text Detection and Script Identification - RRC-MLT , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[23]  Chee Seng Chan,et al.  Total-Text: A Comprehensive Dataset for Scene Text Detection and Recognition , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[24]  Jiri Matas,et al.  Deep TextSpotter: An End-to-End Trainable Scene Text Localization and Recognition Framework , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[25]  Chunhua Shen,et al.  Towards End-to-End Text Spotting with Convolutional Recurrent Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Shuchang Zhou,et al.  EAST: An Efficient and Accurate Scene Text Detector , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Wenyu Liu,et al.  TextBoxes: A Fast Text Detector with a Single Deep Neural Network , 2016, AAAI.

[28]  Xiang Bai,et al.  An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Yuning Jiang,et al.  UnitBox: An Advanced Object Detection Network , 2016, ACM Multimedia.

[30]  Seyed-Ahmad Ahmadi,et al.  V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[31]  A. Vedaldi,et al.  Synthetic Data for Text Localisation in Natural Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Jiri Matas,et al.  COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images , 2016, ArXiv.

[33]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[35]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Andrew Zisserman,et al.  Reading Text in the Wild with Convolutional Neural Networks , 2014, International Journal of Computer Vision.

[37]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[38]  Fred L. Bookstein,et al.  Principal Warps: Thin-Plate Splines and the Decomposition of Deformations , 1989, IEEE Trans. Pattern Anal. Mach. Intell..