Video Text Detection with Fully Convolutional Network and Tracking

Scene text in videos carries rich semantic information that is of great value in various content-based video applications. In this paper, we propose an effective fully convolutional network model for detecting text in videos based on a novel refine block structure. The model hierarchically exploits low-level features from earlier convolutions to refine high-level semantic features, thereby fusing multi-resolution features extracted from the frame to generate high-resolution semantic feature maps for better capturing widely varied appearances of video text. We further complement the individual-frame detection with an efficient correlation filter based text tracking mechanism, and enhance the overall detection performance by matching and combining detection and tracking results. Experiments on public scene text video datasets demonstrate the state-of-the-art performance of the proposed method.

[1]  Shijian Lu,et al.  Text Flow: A Unified Text Detection System in Natural Scene Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  Xu-Cheng Yin,et al.  Multi-strategy tracking based text detection in scene videos , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[3]  Xiang Bai,et al.  Detecting Oriented Text in Natural Images by Linking Segments , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Shijian Lu,et al.  Multioriented Video Scene Text Detection Through Bayesian Classification and Boundary Growing , 2012, IEEE Transactions on Circuits and Systems for Video Technology.

[5]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Jorge Stolfi,et al.  Snoopertrack: Text detection and tracking for outdoor videos , 2011, 2011 18th IEEE International Conference on Image Processing.

[7]  Yonatan Wexler,et al.  Detecting text in natural scenes with stroke width transform , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[8]  Palaiahnakote Shivakumara,et al.  Arbitrarily-oriented multi-lingual text detection in video , 2017, Multimedia Tools and Applications.

[9]  Jon Almazán,et al.  ICDAR 2013 Robust Reading Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[10]  Yuxiao Hu,et al.  Text From Corners: A Novel Approach to Detect Text and Caption in Videos , 2011, IEEE Transactions on Image Processing.

[11]  Shuchang Zhou,et al.  EAST: An Efficient and Accurate Scene Text Detector , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Pan He,et al.  Detecting Text in Natural Image with Connectionist Text Proposal Network , 2016, ECCV.

[13]  Xu-Cheng Yin,et al.  Robust Text Detection in Natural Scene Images. , 2014, IEEE transactions on pattern analysis and machine intelligence.

[14]  Luca Bertinetto,et al.  Staple: Complementary Learners for Real-Time Tracking , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Chun Yang,et al.  Tracking Based Multi-Orientation Scene Text Detection: A Unified Framework With Dynamic Programming , 2017, IEEE Transactions on Image Processing.

[16]  Yang Wang,et al.  Scene Text Detection and Tracking in Video with Background Cues , 2018, ICMR.