Transformer-based Text Detection in the Wild

A major limitation to most state-of-the-art visual localization methods is their ineptitude to make use of ubiquitous signs and directions that are typically intuitive to humans. Localization methods can greatly benefit from a system capable of reasoning about a variety of cues beyond low-level features, such as street signs, store names, building directories, room numbers, etc.In this work, we tackle the problem of text detection in the wild, an essential step towards achieving text-based localization and mapping. While current state-of-the-art text detection methods employ ad-hoc solutions with complex multi-stage components to solve the problem, we propose a Transformer-based architecture inherently capable of dealing with multi-oriented texts in images. A central contribution to our work is the introduction of a loss function tailored to the rotated text detection problem that leverages a rotated version of a generalized intersection over union score to properly capture the rotated text regions.We evaluate our proposed model qualitatively and quantitatively on several challenging datasets namely, IC-DAR15, ICDAR17, and MSRA-TD500, and show that it outperforms current state-of-the-art methods in text detection in the wild.

[1]  Xiang Bai,et al.  TextBoxes++: A Single-Shot Oriented Scene Text Detector , 2018, IEEE Transactions on Image Processing.

[2]  Cong Yao,et al.  UnrealText: Synthesizing Realistic Scene Text Images from the Unreal World , 2020, CVPR 2020.

[3]  Kai Wang,et al.  End-to-end scene text recognition , 2011, 2011 International Conference on Computer Vision.

[4]  Shuchang Zhou,et al.  EAST: An Efficient and Accurate Scene Text Detector , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Dongyoon Han,et al.  Character Region Awareness for Text Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[7]  Lianwen Jin,et al.  Omnidirectional Scene Text Detection with Sequential-free Box Discretization , 2019, IJCAI.

[8]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[9]  Jon Almazán,et al.  ICDAR 2013 Robust Reading Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[10]  Wenyu Liu,et al.  TextBoxes: A Fast Text Detector with a Single Deep Neural Network , 2016, AAAI.

[11]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Mei Xie,et al.  STELA: A Real-Time Scene Text Detector With Learned Anchor , 2019, IEEE Access.

[14]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[16]  Han Lin,et al.  Review of Scene Text Detection and Recognition , 2020, Archives of Computational Methods in Engineering.

[17]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[18]  Lianwen Jin,et al.  Deep Matching Prior Network: Toward Tighter Multi-oriented Text Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Shuicheng Yan,et al.  Multi-oriented Scene Text Detection via Corner Localization and Region Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Zhuowen Tu,et al.  Detecting Texts of Arbitrary Orientations in 1 Natural Images , 2012 .

[21]  D. Zwillinger,et al.  Standard Mathematical Tables and Formulae , 1997, The Mathematical Gazette.

[22]  Fei Yin,et al.  Deep Direct Regression for Multi-oriented Scene Text Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Shunyi Zheng,et al.  R-YOLO: A Real-Time Text Detector for Natural Scenes with Arbitrary Rotation , 2021, Sensors.

[24]  Jie Sheng,et al.  Pyramid Mask Text Detector , 2019, ArXiv.

[25]  Fahad Shahbaz Khan,et al.  Transformers in Vision: A Survey , 2021, ACM Comput. Surv..

[26]  Xiangyang Xue,et al.  Arbitrary-Oriented Scene Text Detection via Rotation Proposals , 2017, IEEE Transactions on Multimedia.

[27]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[28]  A. M. Andrew,et al.  Another Efficient Algorithm for Convex Hulls in Two Dimensions , 1979, Inf. Process. Lett..

[29]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Ernest Valveny,et al.  ICDAR 2015 competition on Robust Reading , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[31]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[32]  Navdeep Jaitly,et al.  Imputer: Sequence Modelling via Imputation and Dynamic Programming , 2020, ICML.

[33]  John Zelek,et al.  Text Detection and Recognition in the Wild: A Review , 2020, ArXiv.

[34]  Mohamed A. Naiel,et al.  2D Positional Embedding-based Transformer for Scene Text Recognition , 2021 .

[35]  Shijian Lu,et al.  Verisimilar Image Synthesis for Accurate Detection and Recognition of Texts in Scenes , 2018, ECCV.

[36]  Xiang Li,et al.  Shape Robust Text Detection With Progressive Scale Expansion Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Valérie Gouet-Brunet,et al.  A survey on Visual-Based Localization: On the benefit of heterogeneous data , 2018, Pattern Recognit..

[38]  Hideki Sumiyoshi,et al.  Scene-Text-Detection Method Robust Against Orientation and Discontiguous Components of Characters , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[39]  Seong Joon Oh,et al.  On Recognizing Texts of Arbitrary Shapes with 2D Self-Attention , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[40]  Xuelong Li,et al.  PixelLink: Detecting Scene Text via Instance Segmentation , 2018, AAAI.

[41]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[42]  Junjie Yan,et al.  FOTS: Fast Oriented Text Spotting with a Unified Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[44]  Tong Lu,et al.  Efficient and Accurate Arbitrary-Shaped Text Detection With Pixel Aggregation Network , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[45]  Naoyuki Morimoto,et al.  ICDAR2017 Robust Reading Challenge on Omnidirectional Video , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).