Video Text Tracking With a Spatio-Temporal Complementary Model

Text tracking is to track multiple texts in a video, and construct a trajectory for each text. Existing methods tackle this task by utilizing the tracking-by-detection framework, i.e., detecting the text instances in each frame and associating the corresponding text instances in consecutive frames. We argue that the tracking accuracy of this paradigm is severely limited in more complex scenarios, e.g., owing to motion blur, etc., the missed detection of text instances causes the break of the text trajectory. In addition, different text instances with similar appearance are easily confused, leading to the incorrect association of the text instances. To this end, a novel spatio-temporal complementary text tracking model is proposed in this paper. We leverage a Siamese Complementary Module to fully exploit the continuity characteristic of the text instances in the temporal dimension, which effectively alleviates the missed detection of the text instances, and hence ensures the completeness of each text trajectory. We further integrate the semantic cues and the visual cues of the text instance into a unified representation via a text similarity learning network, which supplies a high discriminative power in the presence of text instances with similar appearance, and thus avoids the mis-association between them. Our method achieves state-of-the-art performance on several public benchmarks. The source code is available at https://github.com/lsabrinax/VideoTextSCM.

[1]  Mohamed H. Abdelpakey,et al.  DP-Siam: Dynamic Policy Siamese Network for Robust Object Tracking , 2020, IEEE Transactions on Image Processing.

[2]  Qiang Wang,et al.  Fast Online Object Tracking and Segmentation: A Unifying Approach , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Wei Wu,et al.  High Performance Visual Tracking with Siamese Region Proposal Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Anlong Ming,et al.  Boundary-induced and scene-aggregated network for monocular depth prediction , 2021, Pattern Recognit..

[5]  Gregory R. Koch,et al.  Siamese Neural Networks for One-Shot Image Recognition , 2015 .

[6]  C. V. Jawahar,et al.  RoadText-1K: Text Detection & Recognition Dataset for Driving Videos , 2020, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[7]  Xiang Bai,et al.  An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Chun Yang,et al.  Tracking Based Multi-Orientation Scene Text Detection: A Unified Framework With Dynamic Programming , 2017, IEEE Transactions on Image Processing.

[9]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[10]  Shijian Lu,et al.  Multioriented Video Scene Text Detection Through Bayesian Classification and Boundary Growing , 2012, IEEE Transactions on Circuits and Systems for Video Technology.

[11]  Xu-Cheng Yin,et al.  Multi-strategy tracking based text detection in scene videos , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[12]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[13]  Yu Zhou,et al.  Tiny Obstacle Discovery by Occlusion-Aware Multilayer Regression , 2020, IEEE Transactions on Image Processing.

[14]  Xiangyang Xue,et al.  Arbitrary-Oriented Scene Text Detection via Rotation Proposals , 2017, IEEE Transactions on Multimedia.

[15]  Xiang Li,et al.  Shape Robust Text Detection With Progressive Scale Expansion Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Xu-Cheng Yin,et al.  Text Detection, Tracking and Recognition in Video: A Comprehensive Survey , 2016, IEEE Transactions on Image Processing.

[17]  Shiliang Pu,et al.  FREE: A Fast and Robust End-to-End Video Text Spotter , 2020, IEEE Transactions on Image Processing.

[18]  Shuicheng Yan,et al.  Multi-oriented Scene Text Detection via Corner Localization and Region Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Anlong Ming,et al.  Occlusion-Shared and Feature-Separated Network for Occlusion Relationship Reasoning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Yu Zhou,et al.  Online Multiple targets Detection and Tracking from Mobile robot in Cluttered indoor Environments with Depth Camera , 2014, Int. J. Pattern Recognit. Artif. Intell..

[21]  Chun Yang,et al.  Scene Video Text Tracking With Graph Matching , 2018, IEEE Access.

[22]  Kaizhu Huang,et al.  Robust Text Detection in Natural Scene Images , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Zhaoquan Cai,et al.  Deep Constrained Siamese Hash Coding Network and Load-Balanced Locality-Sensitive Hashing for Near Duplicate Image Detection , 2018, IEEE Transactions on Image Processing.

[24]  Rainer Stiefelhagen,et al.  Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics , 2008, EURASIP J. Image Video Process..

[25]  Ali Farhadi,et al.  Semantic Highlight Retrieval and Term Prediction , 2017, IEEE Transactions on Image Processing.

[26]  Palaiahnakote Shivakumara,et al.  A new Histogram Oriented Moments descriptor for multi-oriented moving text detection in video , 2015, Expert Syst. Appl..

[27]  Weilin Huang,et al.  Text Localization in Natural Images Using Stroke Feature Transform and Text Covariance Descriptors , 2013, 2013 IEEE International Conference on Computer Vision.

[28]  Xiang Bai,et al.  Detecting Oriented Text in Natural Images by Linking Segments , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Xiang Bai,et al.  TextBoxes++: A Single-Shot Oriented Scene Text Detector , 2018, IEEE Transactions on Image Processing.

[30]  Yu Zhou,et al.  Similarity Fusion for Visual Tracking , 2015, International Journal of Computer Vision.

[31]  Shuchang Zhou,et al.  EAST: An Efficient and Accurate Scene Text Detector , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Jianbing Shen,et al.  Local Semantic Siamese Networks for Fast Tracking , 2019, IEEE Transactions on Image Processing.

[33]  Errui Ding,et al.  An End-to-End Video Text Detector with Online Tracking , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[34]  Andrew Y. Ng,et al.  Text Detection and Character Recognition in Scene Images with Unsupervised Feature Learning , 2011, 2011 International Conference on Document Analysis and Recognition.

[35]  Anlong Ming,et al.  A Novel Multi-layer Framework for Tiny Obstacle Discovery , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[36]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[38]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[39]  David S. Doermann,et al.  Automatic text detection and tracking in digital video , 2000, IEEE Trans. Image Process..

[40]  Jorge Stolfi,et al.  Snoopertrack: Text detection and tracking for outdoor videos , 2011, 2011 18th IEEE International Conference on Image Processing.

[41]  Wenyu Liu,et al.  TextBoxes: A Fast Text Detector with a Single Deep Neural Network , 2016, AAAI.

[42]  Makoto Tanaka,et al.  Autonomous Text Capturing Robot Using Improved DCT Feature and Text Tracking , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[43]  Yang Wang,et al.  Scene Text Detection and Tracking in Video with Background Cues , 2018, ICMR.

[44]  Palaiahnakote Shivakumara,et al.  Multi-Spectral Fusion Based Approach for Arbitrarily Oriented Scene Text Detection in Video Images , 2015, IEEE Transactions on Image Processing.

[45]  Matthew Turk,et al.  TranslatAR: A mobile augmented reality translator , 2011, 2011 IEEE Workshop on Applications of Computer Vision (WACV).

[46]  Francesco Solera,et al.  Performance Measures and a Data Set for Multi-target, Multi-camera Tracking , 2016, ECCV Workshops.

[47]  Kai Wang,et al.  End-to-end scene text recognition , 2011, 2011 International Conference on Computer Vision.

[48]  Fei Yin,et al.  Deep Direct Regression for Multi-oriented Scene Text Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[49]  Yuxiao Hu,et al.  Text From Corners: A Novel Approach to Detect Text and Caption in Videos , 2011, IEEE Transactions on Image Processing.

[50]  Di Wen,et al.  An Effective Video Text Tracking Algorithm Based on SIFT Feature and Geometric Constraint , 2010, PCM.

[51]  Volker Eiselein,et al.  High-Speed tracking-by-detection without using image information , 2017, 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[52]  Kai Chen,et al.  Real-time Scene Text Detection with Differentiable Binarization , 2019, AAAI.

[53]  Xin Zhao,et al.  TANet: Robust 3D Object Detection from Point Clouds with Triple Attention , 2019, AAAI.

[54]  Ernest Valveny,et al.  ICDAR 2015 competition on Robust Reading , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[55]  Luca Bertinetto,et al.  Fully-Convolutional Siamese Networks for Object Tracking , 2016, ECCV Workshops.

[56]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[57]  Jiri Matas,et al.  A Method for Text Localization and Recognition in Real-World Images , 2010, ACCV.

[58]  Lucas Beyer,et al.  In Defense of the Triplet Loss for Person Re-Identification , 2017, ArXiv.

[59]  Xinlei Chen,et al.  Exploring Simple Siamese Representation Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Christof Koch,et al.  AdaBoost for Text Detection in Natural Scene , 2011, 2011 International Conference on Document Analysis and Recognition.

[61]  Yonatan Wexler,et al.  Detecting text in natural scenes with stroke width transform , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[62]  Yu Zhou,et al.  Object-Level Proposals , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).