Towards Accurate Video Text Spotting with Text-wise Semantic Reasoning

Video text spotting (VTS) aims at extracting texts from videos, where text detection, tracking and recognition are conducted simultaneously. There have been some works that can tackle VTS; however, they may ignore the underlying semantic relationships among texts within a frame. We observe that the texts within a frame usually share similar semantics, which suggests that, if one text is predicted incorrectly by a text recognizer, it still has a chance to be corrected via semantic reasoning. In this paper, we propose an accurate video text spotter, VLSpotter, that reads texts visually, linguistically, and semantically. For ‘visually’, we propose a plug-and-play text-focused super-resolution module to alleviate motion blur and enhance video quality. For ‘linguistically’, a language model is employed to capture intra-text context to mitigate wrongly spelled text predictions. For ‘semantically’, we propose a text-wise semantic reasoning module to model inter-text semantic relationships and reason for better results. The experimental results on multiple VTS benchmarks demonstrate that the proposed VLSpotter outperforms the existing state-of-the-art methods in end-to-end video text spotting.

[1]  Shangchao Su,et al.  Collaborative Chinese Text Recognition with Personalized Federated Learning , 2023, 2305.05602.

[2]  Bin Li,et al.  Chinese Character Recognition with Radical-Structured Stroke Trees , 2022, arXiv.org.

[3]  X. Xue,et al.  Chinese Character Recognition with Augmented Character Profile Matching , 2022, ACM Multimedia.

[4]  Chunhua Shen,et al.  Real-time End-to-End Video Text Spotter with Contrastive Representation Learning , 2022, ArXiv.

[5]  Chao Dong,et al.  Activating More Pixels in Image Super-Resolution Transformer , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  X. Xue,et al.  Benchmarking Chinese Text Recognition: Datasets, Baselines, and an Empirical Study , 2021, ArXiv.

[7]  Mike Zheng Shou,et al.  Contrastive Learning of Semantic and Visual Representations for Text Tracking , 2021, ArXiv.

[8]  Fengxiang He,et al.  Visual Semantics Allow for Textual Reasoning Better in Scene Text Recognition , 2021, AAAI.

[9]  Bin Li,et al.  Text Gestalt: Stroke-Aware Scene Text Image Super-Resolution , 2021, AAAI.

[10]  Yuanqiang Cai,et al.  A Bilingual, OpenWorld Video Text Dataset and End-to-end Video Text Spotter with Transformer , 2021, NeurIPS Datasets and Benchmarks.

[11]  Shuvozit Ghose,et al.  Joint Visual Semantic Reasoning: Multi-Stage Decoder for Text Recognition , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Lei Zhang,et al.  Text Prior Guided Scene Text Image Super-Resolution , 2021, IEEE Transactions on Image Processing.

[13]  Xiangyang Xue,et al.  Zero-Shot Chinese Character Recognition with Stroke-Level Decomposition , 2021, IJCAI.

[14]  Xiangyang Xue,et al.  Scene Text Telescope: Text-Focused Scene Image Super-Resolution , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Fei Yin,et al.  Semantic-Aware Video Text Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Chunhua Shen,et al.  PAN++: Towards Efficient and Accurate End-to-End Spotting of Arbitrarily-Shaped Text , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Yongdong Zhang,et al.  Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Brian L. Price,et al.  Rethinking Text Segmentation: A Novel Dataset and A Text-Specific Refinement Approach , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Shiliang Pu,et al.  FREE: A Fast and Robust End-to-End Video Text Spotter , 2020, IEEE Transactions on Image Processing.

[20]  Xiang Bai,et al.  Scene Text Image Super-Resolution in the Wild , 2020, ECCV.

[21]  Newton Spolaôr,et al.  A systematic review on content-based video retrieval , 2020, Eng. Appl. Artif. Intell..

[22]  Errui Ding,et al.  Towards Accurate Scene Text Recognition With Semantic Reasoning Networks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Felix Stahlberg,et al.  Neural Machine Translation: A Review , 2019, J. Artif. Intell. Res..

[24]  Kai Chen,et al.  Real-time Scene Text Detection with Differentiable Binarization , 2019, AAAI.

[25]  Fei Wu,et al.  You Only Recognize Once: Towards Fast Video Text Spotting , 2019, ACM Multimedia.

[26]  Xiang Li,et al.  Shape Robust Text Detection With Progressive Scale Expansion Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Yang Wang,et al.  Scene Text Detection and Tracking in Video with Background Cues , 2018, ICMR.

[28]  Wei Li,et al.  End-to-End Scene Text Recognition in Videos Based on Multi Frame Tracking , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[29]  Palaiahnakote Shivakumara,et al.  Fractals based multi-oriented text detection system for recognition in mobile video images , 2017, Pattern Recognit..

[30]  Palaiahnakote Shivakumara,et al.  Arbitrarily-oriented multi-lingual text detection in video , 2017, Multimedia Tools and Applications.

[31]  Shuchang Zhou,et al.  EAST: An Efficient and Accurate Scene Text Detector , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Zhipeng Jia,et al.  End-to-end subtitle detection and recognition for videos in East Asian languages via CNN ensemble , 2016, Signal Process. Image Commun..

[33]  R. Paramesran,et al.  Arbitrarily-oriented multi-lingual text detection in video , 2016, Multimedia tools and applications.

[34]  Ernest Valveny,et al.  ICDAR 2015 competition on Robust Reading , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[35]  Palaiahnakote Shivakumara,et al.  A New Technique for Multi-Oriented Scene Text Line Detection and Tracking in Video , 2015, IEEE Transactions on Multimedia.

[36]  Jon Almazán,et al.  ICDAR 2013 Robust Reading Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[37]  Hyung Il Koo,et al.  Scene Text Detection via Connected Component Clustering and Nontext Filtering , 2013, IEEE Transactions on Image Processing.

[38]  Yuxiao Hu,et al.  Text From Corners: A Novel Approach to Detect Text and Caption in Videos , 2011, IEEE Transactions on Image Processing.

[39]  Yonatan Wexler,et al.  Detecting text in natural scenes with stroke width transform , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[40]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[41]  Yan Huang,et al.  End-to-end video text detection with online tracking , 2021, Pattern Recognit..

[42]  Charith Perera,et al.  Automated License Plate Recognition: A Survey on Methods and Techniques , 2021, IEEE Access.

[43]  Hui Yang,et al.  PlugNet: Degradation Aware Scene Text Recognition Supervised by a Pluggable Super-Resolution Unit , 2020, ECCV.

[44]  Xu-Cheng Yin,et al.  Robust Text Detection in Natural Scene Images. , 2014, IEEE transactions on pattern analysis and machine intelligence.