CSART: Channel and spatial attention-guided residual learning for real-time object tracking

Abstract Siamese networks have achieved great success in visual tracking due to the balance of precision and speed. However, Siamese trackers usually utilize the local feature of the last layer, which may degrade tracking performance in some difficult scenarios. In this paper, we propose a novel Channel and Spatial Attention-guided Residual learning framework for Tracking, referred to as CSART, which can improve feature representation of Siamese networks by exploiting self-attention mechanism to capture powerful contextual information. Specifically, to be efficient and seamless integration, different kinds of self-attention are appended on the template and search branches of Siamese networks respectively, that model global semantic inter-dependencies in channel and spatial dimensions. To avoid representation degradation, we consider to adaptively aggregate basic feature and its attention-weighted features with residual learning. Furthermore, a joint loss consisting of classic logistic loss as well as focal softmax loss is designed to emphasize difficult samples and guide the learning process of the whole model. Benefiting from the above scheme, CSART alleviates the over-fitting problem to some extent and enhances the discriminability. Extensive experiments on six popular tracking datasets indicate that the proposed tracker achieves better performance with a speed of 65 fps than other state-of-the-art trackers.

[1]  In-So Kweon,et al.  CBAM: Convolutional Block Attention Module , 2018, ECCV.

[2]  Xiaogang Wang,et al.  Residual Attention Network for Image Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Ling Shao,et al.  Human-Aware Motion Deblurring , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Jianbing Shen,et al.  Local Semantic Siamese Networks for Fast Tracking , 2019, IEEE Transactions on Image Processing.

[5]  Minglu Li,et al.  Reinforced Similarity Learning: Siamese Relation Networks for Robust Object Tracking , 2020, ACM Multimedia.

[6]  Rui Caseiro,et al.  High-Speed Tracking with Kernelized Correlation Filters , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[8]  Luca Bertinetto,et al.  Staple: Complementary Learners for Real-Time Tracking , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Bohyung Han,et al.  Learning Multi-domain Convolutional Neural Networks for Visual Tracking , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Zhonglong Zheng,et al.  High Performance Visual Tracking With Siamese Actor-Critic Network , 2020, 2020 IEEE International Conference on Image Processing (ICIP).

[11]  Allan Jabri,et al.  Learning Correspondence From the Cycle-Consistency of Time , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Xiaogang Jin,et al.  Quadruplet Network With One-Shot Learning for Fast Visual Object Tracking , 2017, IEEE Transactions on Image Processing.

[13]  Haibin Ling,et al.  Siamese Cascaded Region Proposal Networks for Real-Time Visual Tracking , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Dawei Zhang,et al.  Joint Representation Learning with Deep Quadruplet Network for Real-Time Visual Tracking , 2020, 2020 International Joint Conference on Neural Networks (IJCNN).

[15]  Ming-Hsuan Yang,et al.  Object Tracking Benchmark , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Ali Borji,et al.  Salient Object Detection Driven by Fixation Prediction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Zhipeng Zhang,et al.  Deeper and Wider Siamese Networks for Real-Time Visual Tracking , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Michael Felsberg,et al.  Learning Spatially Regularized Correlation Filters for Visual Tracking , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[19]  Chong Luo,et al.  A Twofold Siamese Network for Real-Time Object Tracking , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  L. Shao,et al.  Robust Tracking using Manifold Convolutional Neural Networks with Laplacian Regularization , 2018 .

[21]  Sanyuan Zhao,et al.  Multiple people tracking with articulation detection and stitching strategy , 2020, Neurocomputing.

[22]  Rabul Hussain Laskar,et al.  Dynamic hand gesture recognition using vision-based approach for human–computer interaction , 2018, Neural Computing and Applications.

[23]  Haibin Ling,et al.  A Deep Network Solution for Attention and Aesthetics Aware Photo Cropping , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Ayesha Gurnani,et al.  Human Detection and Tracking for Video Surveillance: A Cognitive Science Approach , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[25]  Ling Shao,et al.  See More, Know More: Unsupervised Video Object Segmentation With Co-Attention Siamese Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Yunchao Wei,et al.  CCNet: Criss-Cross Attention for Semantic Segmentation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  L. Gool,et al.  Learning Discriminative Model Prediction for Tracking , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Jianbing Shen,et al.  Fast Online Tracking With Detection Refinement , 2018, IEEE Transactions on Intelligent Transportation Systems.

[29]  Huchuan Lu,et al.  GradNet: Gradient-Guided Network for Visual Object Tracking , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Liyuan Chen,et al.  Learning Fine-Grained Similarity Matching Networks for Visual Tracking , 2020, ICMR.