论文信息 - FREE: A Fast and Robust End-to-End Video Text Spotter

FREE: A Fast and Robust End-to-End Video Text Spotter

Currently, video text spotting tasks usually fall into the four-staged pipeline: detecting text regions in individual images, recognizing localized text regions frame-wisely, tracking text streams and post-processing to generate final results. However, they may suffer from the huge computational cost as well as sub-optimal results due to the interferences of low-quality text and the none-trainable pipeline strategy. In this article, we propose a fast and robust end-to-end video text spotting framework named FREE by only recognizing the localized text stream one-time instead of frame-wise recognition. Specifically, FREE first employs a well-designed spatial-temporal detector that learns text locations among video frames. Then a novel text recommender is developed to select the highest-quality text from text streams for recognizing. Here, the recommender is implemented by assembling text tracking, quality scoring and recognition into a trainable module. It not only avoids the interferences from the low-quality text but also dramatically speeds up the video text spotting. FREE unites the detector and recommender into a whole framework, and helps achieve global optimization. Besides, we collect a large scale video text dataset for promoting the video text spotting community, containing 100 videos from 21 real-life scenarios. Extensive experiments on public benchmarks show our method greatly speeds up the text spotting process, and also achieves the remarkable state-of-the-art.

[1] Wei Feng,et al. TextDragon: An End-to-End Framework for Arbitrary Shaped Text Spotting , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[2] Junjie Yan,et al. FOTS: Fast Oriented Text Spotting with a Unified Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3] Xiaolin Hu,et al. Gated Recurrent Convolution Neural Network for OCR , 2017, NIPS.

[4] Zhaoqiang Xia,et al. Scene video text tracking based on hybrid deep text detection and layout constraint , 2019, Neurocomputing.

[5] Yang Wang,et al. Scene Text Detection and Tracking in Video with Background Cues , 2018, ICMR.

[6] Palaiahnakote Shivakumara,et al. A new Histogram Oriented Moments descriptor for multi-oriented moving text detection in video , 2015, Expert Syst. Appl..

[7] Harold W. Kuhn,et al. The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[8] Alessandro Bissacco,et al. Towards Unconstrained End-to-End Text Spotting , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9] Yueting Zhuang,et al. Temporality-enhanced knowledgememory network for factoid question answering , 2018, Frontiers of Information Technology & Electronic Engineering.

[10] Xiang Bai,et al. Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11] Xiang Bai,et al. An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12] Rudolph van der Merwe,et al. The unscented Kalman filter for nonlinear estimation , 2000, Proceedings of the IEEE 2000 Adaptive Systems for Signal Processing, Communications, and Control Symposium (Cat. No.00EX373).

[13] Yueting Zhuang,et al. Disambiguating named entities with deep supervised learning via crowd labels , 2017, Frontiers of Information Technology & Electronic Engineering.

[14] Makoto Tanaka,et al. Text-Tracking Wearable Camera System for the Blind , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[15] Errui Ding,et al. An End-to-End Video Text Detector with Online Tracking , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[16] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[17] Chun Yang,et al. Tracking Based Multi-Orientation Scene Text Detection: A Unified Framework With Dynamic Programming , 2017, IEEE Transactions on Image Processing.

[18] Xiang Bai,et al. Robust Scene Text Recognition with Automatic Rectification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Xu-Cheng Yin,et al. Scene Text Detection in Video by Learning Locally and Globally , 2016, IJCAI.

[20] Jiri Matas,et al. On Combining Multiple Segmentations in Scene Text Recognition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[21] Makoto Tanaka,et al. Text-tracking wearable camera system for visually-impaired people , 2008, 2008 19th International Conference on Pattern Recognition.

[22] J. Pearl,et al. Causal Inference , 2011, Twenty-one Mental Models That Can Change Policing.

[23] Wei Li,et al. End-to-End Scene Text Recognition in Videos Based on Multi Frame Tracking , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[24] Matthew Turk,et al. TranslatAR: A mobile augmented reality translator , 2011, 2011 IEEE Workshop on Applications of Computer Vision (WACV).

[25] Shing-Tung Yau,et al. Geometric Understanding of Deep Learning , 2018, ArXiv.

[26] C. V. Jawahar,et al. RoadText-1K: Text Detection & Recognition Dataset for Driving Videos , 2020, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[27] Gui-Song Xia,et al. Rotation-Sensitive Regression for Oriented Scene Text Detection , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28] Chun Chen,et al. Challenges and opportunities: from big data to knowledge in AI 2.0 , 2017, Frontiers of Information Technology & Electronic Engineering.

[29] Palaiahnakote Shivakumara,et al. A New Technique for Multi-Oriented Scene Text Line Detection and Tracking in Video , 2015, IEEE Transactions on Multimedia.

[30] Jiahao Shi,et al. Video Text Detection with Fully Convolutional Network and Tracking , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[31] Futai Zou,et al. Adversarial Seeded Sequence Growing for Weakly-Supervised Temporal Action Localization , 2019, ACM Multimedia.

[32] Lei Sun,et al. An anchor-free region proposal network for Faster R-CNN-based text detection approaches , 2018, International Journal on Document Analysis and Recognition (IJDAR).

[33] Majid Mirmehdi,et al. Real-time text tracking in natural scenes , 2014, IET Comput. Vis..

[34] Xiaodong Yang,et al. Scene text recognition in multiple frames based on text tracking , 2014, 2014 IEEE International Conference on Multimedia and Expo (ICME).

[35] Majid Mirmehdi,et al. Recognizing Text-Based Traffic Signs , 2015, IEEE Transactions on Intelligent Transportation Systems.

[36] Errui Ding,et al. TextNet: Irregular Text Reading from Images with an End-to-End Trainable Network , 2018, ACCV.

[37] Ngai-Man Cheung,et al. Efficient and Deep Person Re-identification Using Multi-level Similarity , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38] Kai Wang,et al. End-to-end scene text recognition , 2011, 2011 International Conference on Computer Vision.

[39] Fei Yin,et al. Deep Direct Regression for Multi-oriented Scene Text Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40] Albert Gordo,et al. Rosetta: Large Scale System for Text Detection and Recognition in Images , 2018, KDD.

[41] Palaiahnakote Shivakumara,et al. Fractals based multi-oriented text detection system for recognition in mobile video images , 2017, Pattern Recognit..

[42] Hao Wang,et al. All You Need Is Boundary: Toward Arbitrary-Shaped Text Spotting , 2019, AAAI.

[43] Yuxiao Hu,et al. Text From Corners: A Novel Approach to Detect Text and Caption in Videos , 2011, IEEE Transactions on Image Processing.

[44] Dacheng Tao,et al. Geometry-Aware Scene Text Detection with Instance Transformation Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45] Di Wen,et al. An Effective Video Text Tracking Algorithm Based on SIFT Feature and Geometric Constraint , 2010, PCM.

[46] Shijian Lu,et al. Multioriented Video Scene Text Detection Through Bayesian Classification and Boundary Growing , 2012, IEEE Transactions on Circuits and Systems for Video Technology.

[47] Xu-Cheng Yin,et al. Multi-strategy tracking based text detection in scene videos , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[48] Joshua B. Tenenbaum,et al. Dark, Beyond Deep: A Paradigm Shift to Cognitive AI with Humanlike Common Sense , 2020, Engineering.

[49] Yujie Wang,et al. Flow-Guided Feature Aggregation for Video Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[50] Qiang Chen,et al. Network In Network , 2013, ICLR.

[51] Jon Almazán,et al. ICDAR 2013 Robust Reading Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[52] Seyed-Ahmad Ahmadi,et al. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[53] Xiangyang Xue,et al. Arbitrary-Oriented Scene Text Detection via Rotation Proposals , 2017, IEEE Transactions on Multimedia.

[54] Wei Li,et al. R2CNN: Rotational Region CNN for Orientation Robust Scene Text Detection , 2017, ArXiv.

[55] Xiang Bai,et al. Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56] Chunhua Shen,et al. Towards End-to-End Text Spotting with Convolutional Recurrent Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[57] Charles Baur,et al. Automatic text detection for mobile augmented reality translation , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[58] Xiang Bai,et al. Detecting Oriented Text in Natural Images by Linking Segments , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59] David Zhang,et al. Fast Visual Tracking via Dense Spatio-temporal Context Learning , 2014, ECCV.

[60] Shuigeng Zhou,et al. AON: Towards Arbitrarily-Oriented Text Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[61] Fei Wu,et al. Text Perceptron: Towards End-to-End Arbitrary-Shaped Text Spotting , 2020, AAAI.

[62] Xiang Bai,et al. ASTER: An Attentional Scene Text Recognizer with Flexible Rectification , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[63] Shuigeng Zhou,et al. Focusing Attention: Towards Accurate Text Recognition in Natural Images , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[64] Jin Hyung Kim,et al. Texture-Based Approach for Text Detection in Images Using Support Vector Machines and Continuously Adaptive Mean Shift Algorithm , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[65] Yonatan Wexler,et al. Detecting text in natural scenes with stroke width transform , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[66] Ernest Valveny,et al. ICDAR 2015 competition on Robust Reading , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[67] Jiri Matas,et al. COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images , 2016, ArXiv.

[68] Fei Wu,et al. Segregated Temporal Assembly Recurrent Networks for Weakly Supervised Multiple Action Detection , 2018, AAAI.

[69] Jiahao Shi,et al. Video Text Detection by Attentive Spatiotemporal Fusion of Deep Convolutional Features , 2019, ACM Multimedia.

[70] Ross B. Girshick,et al. Mask R-CNN , 2017, 1703.06870.

[71] Carlos Merino. A Framework Towards Realtime Detection and Tracking of Text , 2007 .

[72] Andrew Zisserman,et al. Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition , 2014, ArXiv.

[73] Linjie Xing,et al. Convolutional Character Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[74] Yann LeCun,et al. Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[75] Judea Pearl,et al. Causal Inference , 2010 .

[76] Yang Wang,et al. A Robust Approach for Scene Text Detection and Tracking in Video , 2018, PCM.

[77] Gang Wang,et al. Dual Attention Matching Network for Context-Aware Feature Sequence Based Person Re-identification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[78] Lianwen Jin,et al. Deep Matching Prior Network: Toward Tighter Multi-oriented Text Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[79] Changming Sun,et al. An End-to-End TextSpotter with Explicit Alignment and Attention , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[80] Palaiahnakote Shivakumara,et al. Arbitrarily-oriented multi-lingual text detection in video , 2017, Multimedia Tools and Applications.

[81] Kai Wang,et al. Video text detection and recognition: Dataset and benchmark , 2014, IEEE Winter Conference on Applications of Computer Vision.

[82] Xuelong Li,et al. The Next Breakthroughs of Artificial Intelligence: The Interdisciplinary Nature of AI , 2020 .

[83] Yunhe Pan,et al. Multiple Knowledge Representation of Artificial Intelligence , 2020 .

[84] Han Hu,et al. WordSup: Exploiting Word Annotations for Character Based Text Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[85] Kaizhu Huang,et al. Robust Text Detection in Natural Scene Images , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[86] James Philbin,et al. FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[87] Jorge Stolfi,et al. Snoopertrack: Text detection and tracking for outdoor videos , 2011, 2011 18th IEEE International Conference on Image Processing.

[88] Hao Chen,et al. ABCNet: Real-Time Scene Text Spotting With Adaptive Bezier-Curve Network , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[89] Makoto Tanaka,et al. Autonomous Text Capturing Robot Using Improved DCT Feature and Text Tracking , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[90] Dimosthenis Karatzas,et al. MSER-Based Real-Time Text Detection and Tracking , 2014, 2014 22nd International Conference on Pattern Recognition.

[91] Xu-Cheng Yin,et al. Text Detection, Tracking and Recognition in Video: A Comprehensive Survey , 2016, IEEE Transactions on Image Processing.

[92] Hiroaki Kobayashi,et al. An Efficient Text Capture Method for Moving Robots Using DCT Feature and Text Tracking , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[93] Shuigeng Zhou,et al. Edit Probability for Scene Text Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[94] Jing Zhang,et al. Framework for Performance Evaluation of Face, Text, and Vehicle Detection and Tracking in Video: Data, Metrics, and Protocol , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[95] Nizar Bouguila,et al. Automatic Inpainting Scheme for Video Text Detection and Removal , 2013, IEEE Transactions on Image Processing.

[96] Fei Yin,et al. A Unified Video Text Detection Method with Network Flow , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[97] Shuchang Zhou,et al. EAST: An Efficient and Accurate Scene Text Detector , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).