TDIOT: Target-Driven Inference for Deep Video Object Tracking

Recent tracking-by-detection approaches use deep object detectors as target detection baseline, because of their high performance on still images. For effective video object tracking, object detection is integrated with a data association step performed by either a custom design inference architecture or an end-to-end joint training for tracking purpose. In this work, we adopt the former approach and use the pre-trained Mask R-CNN deep object detector as the baseline. We introduce a novel inference architecture placed on top of FPN-ResNet101 backbone of Mask R-CNN to jointly perform detection and tracking, without requiring additional training for tracking purpose. The proposed single object tracker, TDIOT, applies an appearance similarity-based temporal matching for data association. To tackle tracking discontinuities, we incorporate a local search and matching module into the inference head layer that exploits SiamFC. Moreover, to improve robustness to scale changes, we introduce a scale adaptive region proposal network that enables to search for the target at an adaptively enlarged spatial neighborhood specified by the trace of the target. In order to meet long term tracking requirements, a low cost verification layer is incorporated into the inference architecture to monitor presence of the target based on its LBP histogram model. Performance evaluation on videos from VOT2016, VOT2018, and VOT-LT2018 datasets demonstrate that TDIOT achieves higher accuracy compared to the state-of-the-art short-term trackers while it provides comparable performance in long term tracking. We also compare our tracker on LaSOT dataset where we observe that TDIOT provides comparable performance with other methods that are trained on LaSOT. The source code and TDIOT output videos are accessible at https://github.com/msprITU/TDIOT.

[1]  Ke Wang,et al.  Robust Visual Object Tracking with Two-Stream Residual Convolutional Networks , 2020, 2020 25th International Conference on Pattern Recognition (ICPR).

[2]  Yu-Wing Tai,et al.  Few-Shot Object Detection With Attention-RPN and Multi-Relation Detector , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Xindong Wu,et al.  Object Detection With Deep Learning: A Review , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[4]  Na Chen,et al.  A Survey of Vehicle Re-Identification Based on Deep Learning , 2019, IEEE Access.

[5]  Huchuan Lu,et al.  Deep visual tracking: Review and experimental comparison , 2018, Pattern Recognit..

[6]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Xiaogang Wang,et al.  T-CNN: Tubelets With Convolutional Neural Networks for Object Detection From Videos , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[8]  Changsheng Xu,et al.  Learning Multi-Task Correlation Particle Filters for Visual Tracking , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Andrew Zisserman,et al.  Detect to Track and Track to Detect , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[10]  Huchuan Lu,et al.  Learning regression and verification networks for long-term visual tracking , 2018, ArXiv.

[11]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[12]  Bohyung Han,et al.  Learning Multi-domain Convolutional Neural Networks for Visual Tracking , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Guosheng Lin,et al.  Video Object Segmentation and Tracking: A Survey , 2019, ArXiv.

[14]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Wami Object Tracking Using L1 Tracker Integrated with a Deep Detector , 2018, 2018 25th IEEE International Conference on Image Processing (ICIP).

[16]  Zhipeng Zhang,et al.  Ocean: Object-aware Anchor-free Tracking , 2020, ECCV.

[17]  Lucas Beyer,et al.  In Defense of the Triplet Loss for Person Re-Identification , 2017, ArXiv.

[18]  Yuning Jiang,et al.  UnitBox: An Advanced Object Detection Network , 2016, ACM Multimedia.

[19]  Qingming Huang,et al.  Robust visual tracking via scale-and-state-awareness , 2019, Neurocomputing.

[20]  Yuchen Fan,et al.  Video Instance Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Hao Chen,et al.  FCOS: Fully Convolutional One-Stage Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Philip H. S. Torr,et al.  The Eighth Visual Object Tracking VOT2020 Challenge Results , 2020, ECCV Workshops.

[23]  Yi Li,et al.  R-FCN: Object Detection via Region-based Fully Convolutional Networks , 2016, NIPS.

[24]  Xiaogang Wang,et al.  Visual Tracking with Fully Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[25]  Chun Yuan,et al.  Learning attentional recurrent neural network for visual tracking , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[26]  Gui-Song Xia,et al.  Siamese networks with distractor-reduction method for long-term visual object tracking , 2020, Pattern Recognit..

[27]  Ying Cui,et al.  SiamCAR: Siamese Fully Convolutional Classification and Regression for Visual Tracking , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Arnold W. M. Smeulders,et al.  Long-term Tracking in the Wild: A Benchmark , 2018, ECCV.

[29]  Filiz Gurkan,et al.  Robust object tracking via integration of particle filtering with deep detection , 2019, Digit. Signal Process..

[30]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Simon J. Godsill,et al.  Overview of Bayesian sequential Monte Carlo methods for group and extended object tracking , 2014, Digit. Signal Process..

[32]  Zhenyu He,et al.  The Visual Object Tracking VOT2016 Challenge Results , 2016, ECCV Workshops.

[33]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Luca Bertinetto,et al.  Fully-Convolutional Siamese Networks for Object Tracking , 2016, ECCV Workshops.

[35]  Wei Wu,et al.  High Performance Visual Tracking with Siamese Region Proposal Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[37]  Rongrong Ji,et al.  Siamese Box Adaptive Network for Visual Tracking , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  C. Lawrence Zitnick,et al.  Edge Boxes: Locating Object Proposals from Edges , 2014, ECCV.

[39]  Rui Caseiro,et al.  High-Speed Tracking with Kernelized Correlation Filters , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Andreas Geiger,et al.  MOTS: Multi-Object Tracking and Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Alex Bewley,et al.  Deep Cosine Metric Learning for Person Re-identification , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[42]  Wei Wu,et al.  Distractor-aware Siamese Networks for Visual Object Tracking , 2018, ECCV.

[43]  Laura Leal-Taixé,et al.  Tracking Without Bells and Whistles , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[44]  Bohyung Han,et al.  Modeling and Propagating CNNs in a Tree Structure for Visual Tracking , 2016, ArXiv.

[45]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Ming-Hsuan Yang,et al.  Object Tracking Benchmark , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Yuan Yan Tang,et al.  Person reidentification using quaternionic local binary pattern , 2014, 2014 IEEE International Conference on Multimedia and Expo (ICME).

[48]  Dietrich Paulus,et al.  Simple online and realtime tracking with a deep association metric , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[49]  Zhiyong Li,et al.  Robust Object Tracking via Local Sparse Appearance Model , 2018, IEEE Transactions on Image Processing.

[50]  Dit-Yan Yeung,et al.  Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[51]  Yun Fu,et al.  Visual Object Tracking Via Multi-Stream Deep Similarity Learning Networks , 2019, IEEE Transactions on Image Processing.

[52]  Filiz Gurkan,et al.  Robust object tracking by interleaving variable rate color particle filtering and deep learning , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[53]  Jianping Fan,et al.  Deep Spatial and Temporal Network for Robust Visual Object Tracking , 2020, IEEE Transactions on Image Processing.

[54]  Yue Cao,et al.  Spatial-Temporal Relation Networks for Multi-Object Tracking , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[55]  Jinfeng Gong,et al.  TJU-DHD: A Diverse High-Resolution Dataset for Object Detection , 2020, IEEE Transactions on Image Processing.

[56]  Filiz Gurkan,et al.  Integration of Regularized l1 Tracking and Instance Segmentation for Video Object Tracking , 2021, Neurocomputing.

[57]  Zhihai He,et al.  Spatially supervised recurrent convolutional neural networks for visual object tracking , 2016, 2017 IEEE International Symposium on Circuits and Systems (ISCAS).

[58]  Zhichao Lu,et al.  RetinaTrack: Online Single Stage Joint Detection and Tracking , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Qinqin Zhou,et al.  Fine-Grained Spatial Alignment Model for Person Re-Identification With Focal Triplet Loss , 2020, IEEE Transactions on Image Processing.

[60]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[61]  Jun Li,et al.  Deep Alignment Network Based Multi-Person Tracking With Occlusion and Motion Reasoning , 2019, IEEE Transactions on Multimedia.

[62]  Michael Felsberg,et al.  The Sixth Visual Object Tracking VOT2018 Challenge Results , 2018, ECCV Workshops.

[63]  Cordelia Schmid,et al.  Online Object Tracking with Proposal Selection , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[64]  Lei Han,et al.  Deep learning assisted robust visual tracking with adaptive particle filtering , 2018, Signal Process. Image Commun..

[65]  Kai Chen,et al.  Region Proposal by Guided Anchoring , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).