FREE: A Fast and Robust End-to-End Video Text Spotter

Currently, video text spotting tasks usually fall into the four-staged pipeline: detecting text regions in individual images, recognizing localized text regions frame-wisely, tracking text streams and post-processing to generate final results. However, they may suffer from the huge computational cost as well as sub-optimal results due to the interferences of low-quality text and the none-trainable pipeline strategy. In this article, we propose a fast and robust end-to-end video text spotting framework named FREE by only recognizing the localized text stream one-time instead of frame-wise recognition. Specifically, FREE first employs a well-designed spatial-temporal detector that learns text locations among video frames. Then a novel text recommender is developed to select the highest-quality text from text streams for recognizing. Here, the recommender is implemented by assembling text tracking, quality scoring and recognition into a trainable module. It not only avoids the interferences from the low-quality text but also dramatically speeds up the video text spotting. FREE unites the detector and recommender into a whole framework, and helps achieve global optimization. Besides, we collect a large scale video text dataset for promoting the video text spotting community, containing 100 videos from 21 real-life scenarios. Extensive experiments on public benchmarks show our method greatly speeds up the text spotting process, and also achieves the remarkable state-of-the-art.

[1]  Wei Feng,et al.  TextDragon: An End-to-End Framework for Arbitrary Shaped Text Spotting , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  Junjie Yan,et al.  FOTS: Fast Oriented Text Spotting with a Unified Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3]  Xiaolin Hu,et al.  Gated Recurrent Convolution Neural Network for OCR , 2017, NIPS.

[4]  Zhaoqiang Xia,et al.  Scene video text tracking based on hybrid deep text detection and layout constraint , 2019, Neurocomputing.

[5]  Yang Wang,et al.  Scene Text Detection and Tracking in Video with Background Cues , 2018, ICMR.

[6]  Palaiahnakote Shivakumara,et al.  A new Histogram Oriented Moments descriptor for multi-oriented moving text detection in video , 2015, Expert Syst. Appl..

[7]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[8]  Alessandro Bissacco,et al.  Towards Unconstrained End-to-End Text Spotting , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Yueting Zhuang,et al.  Temporality-enhanced knowledgememory network for factoid question answering , 2018, Frontiers of Information Technology & Electronic Engineering.

[10]  Xiang Bai,et al.  Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Xiang Bai,et al.  An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Rudolph van der Merwe,et al.  The unscented Kalman filter for nonlinear estimation , 2000, Proceedings of the IEEE 2000 Adaptive Systems for Signal Processing, Communications, and Control Symposium (Cat. No.00EX373).

[13]  Yueting Zhuang,et al.  Disambiguating named entities with deep supervised learning via crowd labels , 2017, Frontiers of Information Technology & Electronic Engineering.

[14]  Makoto Tanaka,et al.  Text-Tracking Wearable Camera System for the Blind , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[15]  Errui Ding,et al.  An End-to-End Video Text Detector with Online Tracking , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[16]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[17]  Chun Yang,et al.  Tracking Based Multi-Orientation Scene Text Detection: A Unified Framework With Dynamic Programming , 2017, IEEE Transactions on Image Processing.

[18]  Xiang Bai,et al.  Robust Scene Text Recognition with Automatic Rectification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Xu-Cheng Yin,et al.  Scene Text Detection in Video by Learning Locally and Globally , 2016, IJCAI.

[20]  Jiri Matas,et al.  On Combining Multiple Segmentations in Scene Text Recognition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[21]  Makoto Tanaka,et al.  Text-tracking wearable camera system for visually-impaired people , 2008, 2008 19th International Conference on Pattern Recognition.

[22]  J. Pearl,et al.  Causal Inference , 2011, Twenty-one Mental Models That Can Change Policing.

[23]  Wei Li,et al.  End-to-End Scene Text Recognition in Videos Based on Multi Frame Tracking , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[24]  Matthew Turk,et al.  TranslatAR: A mobile augmented reality translator , 2011, 2011 IEEE Workshop on Applications of Computer Vision (WACV).

[25]  Shing-Tung Yau,et al.  Geometric Understanding of Deep Learning , 2018, ArXiv.

[26]  C. V. Jawahar,et al.  RoadText-1K: Text Detection & Recognition Dataset for Driving Videos , 2020, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[27]  Gui-Song Xia,et al.  Rotation-Sensitive Regression for Oriented Scene Text Detection , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Chun Chen,et al.  Challenges and opportunities: from big data to knowledge in AI 2.0 , 2017, Frontiers of Information Technology & Electronic Engineering.

[29]  Palaiahnakote Shivakumara,et al.  A New Technique for Multi-Oriented Scene Text Line Detection and Tracking in Video , 2015, IEEE Transactions on Multimedia.

[30]  Jiahao Shi,et al.  Video Text Detection with Fully Convolutional Network and Tracking , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[31]  Futai Zou,et al.  Adversarial Seeded Sequence Growing for Weakly-Supervised Temporal Action Localization , 2019, ACM Multimedia.

[32]  Lei Sun,et al.  An anchor-free region proposal network for Faster R-CNN-based text detection approaches , 2018, International Journal on Document Analysis and Recognition (IJDAR).

[33]  Majid Mirmehdi,et al.  Real-time text tracking in natural scenes , 2014, IET Comput. Vis..

[34]  Xiaodong Yang,et al.  Scene text recognition in multiple frames based on text tracking , 2014, 2014 IEEE International Conference on Multimedia and Expo (ICME).

[35]  Majid Mirmehdi,et al.  Recognizing Text-Based Traffic Signs , 2015, IEEE Transactions on Intelligent Transportation Systems.

[36]  Errui Ding,et al.  TextNet: Irregular Text Reading from Images with an End-to-End Trainable Network , 2018, ACCV.

[37]  Ngai-Man Cheung,et al.  Efficient and Deep Person Re-identification Using Multi-level Similarity , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  Kai Wang,et al.  End-to-end scene text recognition , 2011, 2011 International Conference on Computer Vision.

[39]  Fei Yin,et al.  Deep Direct Regression for Multi-oriented Scene Text Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Albert Gordo,et al.  Rosetta: Large Scale System for Text Detection and Recognition in Images , 2018, KDD.

[41]  Palaiahnakote Shivakumara,et al.  Fractals based multi-oriented text detection system for recognition in mobile video images , 2017, Pattern Recognit..

[42]  Hao Wang,et al.  All You Need Is Boundary: Toward Arbitrary-Shaped Text Spotting , 2019, AAAI.

[43]  Yuxiao Hu,et al.  Text From Corners: A Novel Approach to Detect Text and Caption in Videos , 2011, IEEE Transactions on Image Processing.

[44]  Dacheng Tao,et al.  Geometry-Aware Scene Text Detection with Instance Transformation Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]  Di Wen,et al.  An Effective Video Text Tracking Algorithm Based on SIFT Feature and Geometric Constraint , 2010, PCM.

[46]  Shijian Lu,et al.  Multioriented Video Scene Text Detection Through Bayesian Classification and Boundary Growing , 2012, IEEE Transactions on Circuits and Systems for Video Technology.

[47]  Xu-Cheng Yin,et al.  Multi-strategy tracking based text detection in scene videos , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[48]  Joshua B. Tenenbaum,et al.  Dark, Beyond Deep: A Paradigm Shift to Cognitive AI with Humanlike Common Sense , 2020, Engineering.

[49]  Yujie Wang,et al.  Flow-Guided Feature Aggregation for Video Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[50]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[51]  Jon Almazán,et al.  ICDAR 2013 Robust Reading Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[52]  Seyed-Ahmad Ahmadi,et al.  V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[53]  Xiangyang Xue,et al.  Arbitrary-Oriented Scene Text Detection via Rotation Proposals , 2017, IEEE Transactions on Multimedia.

[54]  Wei Li,et al.  R2CNN: Rotational Region CNN for Orientation Robust Scene Text Detection , 2017, ArXiv.

[55]  Xiang Bai,et al.  Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Chunhua Shen,et al.  Towards End-to-End Text Spotting with Convolutional Recurrent Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[57]  Charles Baur,et al.  Automatic text detection for mobile augmented reality translation , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[58]  Xiang Bai,et al.  Detecting Oriented Text in Natural Images by Linking Segments , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  David Zhang,et al.  Fast Visual Tracking via Dense Spatio-temporal Context Learning , 2014, ECCV.

[60]  Shuigeng Zhou,et al.  AON: Towards Arbitrarily-Oriented Text Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[61]  Fei Wu,et al.  Text Perceptron: Towards End-to-End Arbitrary-Shaped Text Spotting , 2020, AAAI.

[62]  Xiang Bai,et al.  ASTER: An Attentional Scene Text Recognizer with Flexible Rectification , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[63]  Shuigeng Zhou,et al.  Focusing Attention: Towards Accurate Text Recognition in Natural Images , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[64]  Jin Hyung Kim,et al.  Texture-Based Approach for Text Detection in Images Using Support Vector Machines and Continuously Adaptive Mean Shift Algorithm , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[65]  Yonatan Wexler,et al.  Detecting text in natural scenes with stroke width transform , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[66]  Ernest Valveny,et al.  ICDAR 2015 competition on Robust Reading , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[67]  Jiri Matas,et al.  COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images , 2016, ArXiv.

[68]  Fei Wu,et al.  Segregated Temporal Assembly Recurrent Networks for Weakly Supervised Multiple Action Detection , 2018, AAAI.

[69]  Jiahao Shi,et al.  Video Text Detection by Attentive Spatiotemporal Fusion of Deep Convolutional Features , 2019, ACM Multimedia.

[70]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[71]  Carlos Merino A Framework Towards Realtime Detection and Tracking of Text , 2007 .

[72]  Andrew Zisserman,et al.  Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition , 2014, ArXiv.

[73]  Linjie Xing,et al.  Convolutional Character Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[74]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[75]  Judea Pearl,et al.  Causal Inference , 2010 .

[76]  Yang Wang,et al.  A Robust Approach for Scene Text Detection and Tracking in Video , 2018, PCM.

[77]  Gang Wang,et al.  Dual Attention Matching Network for Context-Aware Feature Sequence Based Person Re-identification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[78]  Lianwen Jin,et al.  Deep Matching Prior Network: Toward Tighter Multi-oriented Text Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[79]  Changming Sun,et al.  An End-to-End TextSpotter with Explicit Alignment and Attention , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[80]  Palaiahnakote Shivakumara,et al.  Arbitrarily-oriented multi-lingual text detection in video , 2017, Multimedia Tools and Applications.

[81]  Kai Wang,et al.  Video text detection and recognition: Dataset and benchmark , 2014, IEEE Winter Conference on Applications of Computer Vision.

[82]  Xuelong Li,et al.  The Next Breakthroughs of Artificial Intelligence: The Interdisciplinary Nature of AI , 2020 .

[83]  Yunhe Pan,et al.  Multiple Knowledge Representation of Artificial Intelligence , 2020 .

[84]  Han Hu,et al.  WordSup: Exploiting Word Annotations for Character Based Text Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[85]  Kaizhu Huang,et al.  Robust Text Detection in Natural Scene Images , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[86]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[87]  Jorge Stolfi,et al.  Snoopertrack: Text detection and tracking for outdoor videos , 2011, 2011 18th IEEE International Conference on Image Processing.

[88]  Hao Chen,et al.  ABCNet: Real-Time Scene Text Spotting With Adaptive Bezier-Curve Network , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[89]  Makoto Tanaka,et al.  Autonomous Text Capturing Robot Using Improved DCT Feature and Text Tracking , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[90]  Dimosthenis Karatzas,et al.  MSER-Based Real-Time Text Detection and Tracking , 2014, 2014 22nd International Conference on Pattern Recognition.

[91]  Xu-Cheng Yin,et al.  Text Detection, Tracking and Recognition in Video: A Comprehensive Survey , 2016, IEEE Transactions on Image Processing.

[92]  Hiroaki Kobayashi,et al.  An Efficient Text Capture Method for Moving Robots Using DCT Feature and Text Tracking , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[93]  Shuigeng Zhou,et al.  Edit Probability for Scene Text Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[94]  Jing Zhang,et al.  Framework for Performance Evaluation of Face, Text, and Vehicle Detection and Tracking in Video: Data, Metrics, and Protocol , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[95]  Nizar Bouguila,et al.  Automatic Inpainting Scheme for Video Text Detection and Removal , 2013, IEEE Transactions on Image Processing.

[96]  Fei Yin,et al.  A Unified Video Text Detection Method with Network Flow , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[97]  Shuchang Zhou,et al.  EAST: An Efficient and Accurate Scene Text Detector , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).