Video Text Detection by Attentive Spatiotemporal Fusion of Deep Convolutional Features

Scene text in videos carries rich semantic information and plays an important role in various content-based video applications. Compared to text in static images, scene text in videos exhibits some distinct characteristics such as motion blur and temporal redundancy, which bring additional difficulties as well as exploitable clues to the text detection task. In this paper, we propose a novel end-to-end deep neural network for detecting scene text in the video, which combines complementary text features from multiple related frames to enhance the overall detection performance relative to single-frame detection schemes. Specifically, we first extract descriptive features from each video frame using a hierarchical convolutional neural network. Next, we spatiotemporally sample and warp supplementary features from adjacent frames surrounding the current frame using a multi-scale deformable convolution structure. We then aggregate the sampled features with an attention mechanism to adaptively focus on and augment relevant features and generate an enhanced feature representation of the current frame, which is further fed to the prediction network for localizing text candidates. The proposed model achieves state-of-the-art text detection performance on public scene text video datasets, demonstrating the superiority of the proposed multi-frame feature fusion based video text detection scheme to most single-frame and tracking-based detection schemes.

[1]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Kai Wang,et al.  End-to-end scene text recognition , 2011, 2011 International Conference on Computer Vision.

[3]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Jiri Matas,et al.  Scene Text Localization and Recognition with Oriented Stroke Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[5]  Xiaolin Li,et al.  Single Shot Text Detector with Regional Attention , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[6]  Gui-Song Xia,et al.  Rotation-Sensitive Regression for Oriented Scene Text Detection , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Palaiahnakote Shivakumara,et al.  A new Histogram Oriented Moments descriptor for multi-oriented moving text detection in video , 2015, Expert Syst. Appl..

[8]  Yuxiao Hu,et al.  Text From Corners: A Novel Approach to Detect Text and Caption in Videos , 2011, IEEE Transactions on Image Processing.

[9]  Kaizhu Huang,et al.  Robust Text Detection in Natural Scene Images , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Shuchang Zhou,et al.  EAST: An Efficient and Accurate Scene Text Detector , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Palaiahnakote Shivakumara,et al.  Arbitrarily-oriented multi-lingual text detection in video , 2017, Multimedia Tools and Applications.

[12]  Jiri Matas,et al.  Robust wide-baseline stereo from maximally stable extremal regions , 2004, Image Vis. Comput..

[13]  Wenyu Liu,et al.  TextBoxes: A Fast Text Detector with a Single Deep Neural Network , 2016, AAAI.

[14]  Pan He,et al.  Detecting Text in Natural Image with Connectionist Text Proposal Network , 2016, ECCV.

[15]  Palaiahnakote Shivakumara,et al.  New Fourier-Statistical Features in RGB Space for Video Text Detection , 2010, IEEE Transactions on Circuits and Systems for Video Technology.

[16]  Weilin Huang,et al.  Robust Scene Text Detection with Convolution Neural Network Induced MSER Trees , 2014, ECCV.

[17]  Shijian Lu,et al.  Accurate Scene Text Detection through Border Semantics Awareness and Bootstrapping , 2018, ECCV.

[18]  Andrew Zisserman,et al.  Detect to Track and Track to Detect , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[19]  Yujie Wang,et al.  Flow-Guided Feature Aggregation for Video Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[20]  Shijian Lu,et al.  Multioriented Video Scene Text Detection Through Bayesian Classification and Boundary Growing , 2012, IEEE Transactions on Circuits and Systems for Video Technology.

[21]  Yi Li,et al.  Deformable Convolutional Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[22]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[23]  Yang Wang,et al.  Scene Text Detection and Tracking in Video with Background Cues , 2018, ICMR.

[24]  Chun Yang,et al.  Tracking Based Multi-Orientation Scene Text Detection: A Unified Framework With Dynamic Programming , 2017, IEEE Transactions on Image Processing.

[25]  Xiaogang Wang,et al.  Object Detection from Video Tubelets with Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Changsong Liu,et al.  A research on Video text tracking and recognition , 2013, Electronic Imaging.

[27]  Jin Hyung Kim,et al.  Texture-Based Approach for Text Detection in Images Using Support Vector Machines and Continuously Adaptive Mean Shift Algorithm , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[28]  Palaiahnakote Shivakumara,et al.  A New Technique for Multi-Oriented Scene Text Line Detection and Tracking in Video , 2015, IEEE Transactions on Multimedia.

[29]  Jianbo Shi,et al.  Object Detection in Video with Spatiotemporal Sampling Networks , 2018, ECCV.

[30]  Kai Wang,et al.  Video text detection and recognition: Dataset and benchmark , 2014, IEEE Winter Conference on Applications of Computer Vision.

[31]  Yonatan Wexler,et al.  Detecting text in natural scenes with stroke width transform , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[32]  Shijian Lu,et al.  Text Flow: A Unified Text Detection System in Natural Scene Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[33]  Xiang Bai,et al.  Detecting Oriented Text in Natural Images by Linking Segments , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).