MSAT: Multi-scale Sparse Attention Target Tracking

Current object tracking task have seen significant performance improvements is due to the huge expansion of model parameters. The Transformer architecture with global features and modality fusion capabilities is a typical example. However, the complete attention mechanism used in this architecture has a quadratic complexity dependence on the length of the input sequence, making it impractical for many practical engineering scenarios. In this paper, we propose an MSAT tracker based on multi-scale sparse Transformer, which includes a multi-scale feature interaction backbone network, that performs multi-stage feature fusion to enhance anti-interference ability and robustness, and alleviate the interference accumulation effect of shallow features in the backbone feature extraction. Secondly, we design a feature fusion module based on the deformable attention mechanism and the bilateral sparse attention mechanism, achieves the effect of reducing the memory consumption of Transformer, improving the focus on target features, and weakening the interference of background information. Additionally, we propose a multi-stage template feature update strategy to ensure that reliable template information is obtained during tracking. Extensive experiments show that our tracker achieved efficient performance that surpassed the most advanced trackers on GOT-10k, OTB2015, TrackingNet and LaSOT, while runs a stable real-time tracking at about 45FPS.

[1]  Junsong Yuan,et al.  AiATrack: Attention in Attention for Transformer Visual Tracking , 2022, ECCV.

[2]  Philip H. S. Torr,et al.  SiamMask: A Framework for Fast Online Object Tracking and Segmentation , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Yunhong Wang,et al.  SparseTT: Visual Tracking with Sparse Transformers , 2022, IJCAI.

[4]  Limin Wang,et al.  MixFormer: End-to-End Tracking with Iterative Mixed Attention , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Yizhou Yu,et al.  BOAT: Bilateral Local Attention Vision Transformer , 2022, BMVC.

[6]  Trevor Darrell,et al.  A ConvNet for the 2020s , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  L. Gool,et al.  Efficient Visual Tracking with Exemplar Transformers , 2021, 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[8]  Haibin Ling,et al.  SwinTrack: A Simple and Strong Baseline for Transformer Tracking , 2021, NeurIPS.

[9]  Yihao Liu,et al.  Learn to Match: Automatic Matching Network Design for Visual Tracking , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Jianlong Fu,et al.  Learning Spatio-Temporal Transformer for Visual Tracking , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Huchuan Lu,et al.  Transformer Tracking , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[13]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[14]  M. Zaheer,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[15]  Zhipeng Zhang,et al.  Ocean: Object-aware Anchor-free Tracking , 2020, ECCV.

[16]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[17]  Fahad Shahbaz Khan,et al.  Learning the Model Update for Siamese Trackers , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[19]  Wei Wu,et al.  SiamRPN++: Evolution of Siamese Visual Tracking With Very Deep Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Kaiqi Huang,et al.  GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Fan Yang,et al.  LaSOT: A High-Quality Benchmark for Large-Scale Single Object Tracking , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Junliang Xing,et al.  Learning Attentions: Residual Attentional Siamese Network for High Performance Online Visual Tracking , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Wei Wu,et al.  High Performance Visual Tracking with Siamese Region Proposal Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Bernard Ghanem,et al.  TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild , 2018, ECCV.

[25]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[26]  Luca Bertinetto,et al.  End-to-End Representation Learning for Correlation Filter Based Tracking , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Luca Bertinetto,et al.  Fully-Convolutional Siamese Networks for Object Tracking , 2016, ECCV Workshops.

[28]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[29]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[30]  Bruce A. Draper,et al.  Visual object tracking using adaptive correlation filters , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[31]  J. Beveridge,et al.  Average of Synthetic Exact Filters , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).