TRAT: Tracking by Attention Using Spatio-Temporal Features

Robust object tracking requires knowledge of tracked objects' appearance, motion and their evolution over time. Although motion provides distinctive and complementary information especially for fast moving objects, most of the recent tracking architectures primarily focus on the objects' appearance information. In this paper, we propose a two-stream deep neural network tracker that uses both spatial and temporal features. Our architecture is developed over ATOM tracker and contains two backbones: (i) 2D-CNN network to capture appearance features and (ii) 3D-CNN network to capture motion features. The features returned by the two networks are then fused with attention based Feature Aggregation Module (FAM). Since the whole architecture is unified, it can be trained end-to-end. The experimental results show that the proposed tracker TRAT (TRacking by ATtention) achieves state-of-the-art performance on most of the benchmarks and it significantly outperforms the baseline ATOM tracker.

[1]  Wei Wu,et al.  High Performance Visual Tracking with Siamese Region Proposal Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Gerhard Rigoll,et al.  You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization , 2019, ArXiv.

[3]  Zhenyu He,et al.  The Seventh Visual Object Tracking VOT2019 Challenge Results , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[4]  Xiaogang Wang,et al.  Visual Tracking with Fully Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[5]  Dit-Yan Yeung,et al.  Learning a Deep Compact Image Representation for Visual Tracking , 2013, NIPS.

[6]  Luca Bertinetto,et al.  End-to-End Representation Learning for Correlation Filter Based Tracking , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Cordelia Schmid,et al.  Learning to Track for Spatio-Temporal Action Localization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[8]  Rui Caseiro,et al.  High-Speed Tracking with Kernelized Correlation Filters , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Luca Bertinetto,et al.  Fully-Convolutional Siamese Networks for Object Tracking , 2016, ECCV Workshops.

[10]  Arnold W. M. Smeulders,et al.  UvA-DARE (Digital Academic Repository) Siamese Instance Search for Tracking , 2016 .

[11]  Wei Wu,et al.  End-to-End Flow Correlation Tracking with Spatial-Temporal Attention , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Weilin Huang,et al.  Deformable Siamese Attention Networks for Visual Object Tracking , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Wei Wu,et al.  Distractor-aware Siamese Networks for Visual Object Tracking , 2018, ECCV.

[14]  Ming-Hsuan Yang,et al.  Hierarchical Convolutional Features for Visual Tracking , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[15]  Michael Felsberg,et al.  Adaptive Color Attributes for Real-Time Visual Tracking , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Huchuan Lu,et al.  ROI Pooled Correlation Filters for Visual Tracking , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Peng Lu,et al.  Learning Aberrance Repressed Correlation Filters for Real-Time UAV Tracking , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Qiang Wang,et al.  Fast Online Object Tracking and Segmentation: A Unifying Approach , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Bernard Ghanem,et al.  TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild , 2018, ECCV.

[20]  Rongrong Ji,et al.  Siamese Box Adaptive Network for Visual Tracking , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Hakan Cevikalp,et al.  A Hybrid Method for Tracking of Objects by UAVs , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[22]  Haibin Ling,et al.  Siamese Cascaded Region Proposal Networks for Real-Time Visual Tracking , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Josef Kittler,et al.  Joint Group Feature Selection and Discriminative Filter Learning for Robust Visual Object Tracking , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[25]  Yi Li,et al.  DeepTrack: Learning Discriminative Feature Representations Online for Robust Visual Tracking , 2015, IEEE Transactions on Image Processing.

[26]  Qiang Wang,et al.  Robust Object Tracking Based on Temporal and Spatial Deep Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Michael Felsberg,et al.  Beyond Correlation Filters: Learning Continuous Convolution Operators for Visual Tracking , 2016, ECCV.

[28]  Simon Lucey,et al.  Need for Speed: A Benchmark for Higher Frame Rate Object Tracking , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[29]  Bohyung Han,et al.  Learning Multi-domain Convolutional Neural Networks for Visual Tracking , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Michael Felsberg,et al.  Convolutional Features for Correlation Filter Based Visual Tracking , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[31]  Qiang Wang,et al.  DCFNet: Discriminant Correlation Filters Network for Visual Tracking , 2017, ArXiv.

[32]  Tianzhu Zhang,et al.  Graph Convolutional Tracking , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Michael Felsberg,et al.  Deep motion features for visual tracking , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[34]  Junseok Kwon,et al.  Deep Meta Learning for Real-Time Target-Aware Visual Tracking , 2017, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Nadjiba Terki,et al.  Hierarchical convolutional features for visual tracking via two combined color spaces with SVM classifier , 2018, Signal, Image and Video Processing.

[36]  Michael Felsberg,et al.  Unveiling the Power of Deep Tracking , 2018, ECCV.

[37]  Seunghoon Hong,et al.  Online Tracking by Learning Discriminative Saliency Map with Convolutional Neural Network , 2015, ICML.

[38]  Hakan Cevikalp,et al.  Visual Object Tracking by Using Ranking Loss , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[39]  Michael Felsberg,et al.  The Visual Object Tracking VOT2017 Challenge Results , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[40]  Michael Felsberg,et al.  The Sixth Visual Object Tracking VOT2018 Challenge Results , 2018, ECCV Workshops.

[41]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[42]  Ming-Hsuan Yang,et al.  Object Tracking Benchmark , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Jiri Matas,et al.  D3S – A Discriminative Single Shot Segmentation Tracker , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Wei Wu,et al.  SiamRPN++: Evolution of Siamese Visual Tracking With Very Deep Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Bohyung Han,et al.  Real-Time MDNet , 2018, ECCV.

[46]  Michael Felsberg,et al.  ECO: Efficient Convolution Operators for Tracking , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Bernard Ghanem,et al.  A Benchmark and Simulator for UAV Tracking , 2016, ECCV.

[48]  David Zhang,et al.  Fast Visual Tracking via Dense Spatio-temporal Context Learning , 2014, ECCV.

[49]  Michael Felsberg,et al.  ATOM: Accurate Tracking by Overlap Maximization , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Wei Wu,et al.  STM: SpatioTemporal and Motion Encoding for Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[51]  Bruce A. Draper,et al.  Visual object tracking using adaptive correlation filters , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[52]  Huchuan Lu,et al.  GradNet: Gradient-Guided Network for Visual Object Tracking , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[53]  Ling Shao,et al.  3C-Net: Category Count and Center Loss for Weakly-Supervised Action Localization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[54]  Gang Wang,et al.  Joint Learning of Convolutional Neural Networks and Temporally Constrained Metrics for Tracklet Association , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[55]  Song Wang,et al.  Learning Dynamic Siamese Network for Visual Object Tracking , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[56]  A. Aydın Alatan,et al.  Good Features to Correlate for Visual Tracking , 2017, IEEE Transactions on Image Processing.

[57]  Fan Yang,et al.  LaSOT: A High-Quality Benchmark for Large-Scale Single Object Tracking , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Luca Bertinetto,et al.  Staple: Complementary Learners for Real-Time Tracking , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Xin Zhao,et al.  GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[61]  Fahad Shahbaz Khan,et al.  D2Det: Towards High Quality Object Detection and Instance Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[63]  Huchuan Lu,et al.  Visual Tracking via Adaptive Spatially-Regularized Correlation Filters , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Luc Van Gool,et al.  Learning Discriminative Model Prediction for Tracking , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[65]  Zhiwei Xiong,et al.  SPM-Tracker: Series-Parallel Matching for Real-Time Visual Object Tracking , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Qilong Wang,et al.  ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Limin Wang,et al.  LIP: Local Importance-Based Pooling , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[68]  Yuning Jiang,et al.  Acquisition of Localization Confidence for Accurate Object Detection , 2018, ECCV.

[69]  Wei Liu,et al.  Unsupervised Deep Tracking , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[71]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[72]  Abhinav Gupta,et al.  Transferring Rich Feature Hierarchies for Robust Visual Tracking , 2015, ArXiv.