Learning Spatial-Frequency Transformer for Visual Object Tracking

Recent trackers adopt the Transformer to combine or replace the widely used ResNet as their new backbone network. Although their trackers work well in regular scenarios, however, they simply flatten the 2D features into a sequence to better match the Transformer. We believe these operations ignore the spatial prior of the target object which may lead to sub-optimal results only. In addition, many works demonstrate that self-attention is actually a low-pass filter, which is independent of input features or key/queries. That is to say, it may suppress the high-frequency component of the input features and preserve or even amplify the low-frequency information. To handle these issues, in this paper, we propose a unified Spatial-Frequency Transformer that models the Gaussian spatial Prior and High-frequency emphasis Attention (GPHA) simultaneously. To be specific, Gaussian spatial prior is generated using dual Multi-Layer Perceptrons (MLPs) and injected into the similarity matrix produced by multiplying Query and Key features in self-attention. The output will be fed into a Softmax layer and then decomposed into two components, i.e., the direct signal and high-frequency signal. The low- and high-pass branches are rescaled and combined to achieve all-pass, therefore, the high-frequency features will be protected well in stacked self-attention layers. We further integrate the Spatial-Frequency Transformer into the Siamese tracking framework and propose a novel tracking algorithm, termed SFTransT. The cross-scale fusion based SwinTransformer is adopted as the backbone, and also a multi-head cross-attention module is used to boost the interaction between search and template features. The output will be fed into the tracking head for target localization. Extensive experiments on both short-term and long-term tracking benchmarks all demonstrate the effectiveness of our proposed framework.

[1]  Xiaohan Wang,et al.  Symbiotic Attention for Egocentric Action Recognition With Object-Centric Alignment , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Dacheng Tao,et al.  WebUAV-3 M: A Benchmark for Unveiling the Power of Million-Scale Deep UAV Tracking. , 2022, IEEE transactions on pattern analysis and machine intelligence.

[3]  Tianzhu Zhang,et al.  Target-Distractor Aware Deep Tracking With Discriminative Enhancement Learning Loss , 2022, IEEE Transactions on Circuits and Systems for Video Technology.

[4]  Zhiwei He,et al.  Spreading Fine-Grained Prior Knowledge for Accurate Tracking , 2022, IEEE Transactions on Circuits and Systems for Video Technology.

[5]  Changxin Gao,et al.  Instance-Based Feature Pyramid for Visual Object Tracking , 2022, IEEE Transactions on Circuits and Systems for Video Technology.

[6]  Junqing Yu,et al.  Transformer Tracking with Cyclic Shifting Window Attention , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  L. Gool,et al.  Transforming Model Prediction for Tracking , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Limin Wang,et al.  MixFormer: End-to-End Tracking with Iterative Mixed Attention , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Zhangyang Wang,et al.  Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice , 2022, ICLR.

[10]  Shahram Shirani,et al.  Feature Aggregation Networks Based on Dual Attention Capsules for Visual Object Tracking , 2022, IEEE Transactions on Circuits and Systems for Video Technology.

[11]  Yonghong Tian,et al.  Event-based Video Reconstruction via Potential-assisted Spiking Neural Network , 2022, Computer Vision and Pattern Recognition.

[12]  Shiming Ge,et al.  WebUAV-3M: A Benchmark for Unveiling the Power of Million-Scale Deep UAV Tracking , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Haibin Ling,et al.  SwinTrack: A Simple and Strong Baseline for Transformer Tracking , 2021, NeurIPS.

[14]  Yonghong Tian,et al.  Tracking by Joint Local and Global Search: A Target-Aware Attention-Based Approach , 2021, IEEE Transactions on Neural Networks and Learning Systems.

[15]  Seyed Mojtaba Marvasti-Zadeh,et al.  Deep Learning for Visual Tracking: A Comprehensive Survey , 2019, IEEE Transactions on Intelligent Transportation Systems.

[16]  Yueting Zhuang,et al.  Multiple knowledge representation for big data artificial intelligence: framework, applications, and case studies , 2021, Frontiers of Information Technology & Electronic Engineering.

[17]  Wen Yang,et al.  A Normalized Gaussian Wasserstein Distance for Tiny Object Detection , 2021, ArXiv.

[18]  Hanqing Lu,et al.  High-Performance Discriminative Tracking with Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Yonghong Tian,et al.  VisEvent: Reliable Object Tracking via Collaboration of Frame and Event Flows , 2021, ArXiv.

[20]  Yihao Liu,et al.  Learn to Match: Automatic Matching Network Design for Visual Tracking , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Qingjie Liu,et al.  STMTrack: Template-free Visual Tracking with Space-time Memory Networks , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Jianlong Fu,et al.  Learning Spatio-Temporal Transformer for Visual Tracking , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Yonghong Tian,et al.  Towards More Flexible and Accurate Object Tracking with Natural Language: Algorithms and Benchmark , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Luc Van Gool,et al.  Learning Target Candidate Association to Keep Track of What Not to Track , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Huchuan Lu,et al.  Transformer Tracking , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  N. Codella,et al.  CvT: Introducing Convolutions to Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Wengang Zhou,et al.  Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Junchi Yan,et al.  Rethinking Rotated Object Detection with Gaussian Wasserstein Distance Loss , 2021, ICML.

[29]  Ying Cui,et al.  Graph Attention Tracking , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[31]  Changhong Fu,et al.  Learning Temporary Block-Based Bidirectional Incongruity-Aware Correlation Filters for Efficient UAV Object Tracking , 2020, IEEE Transactions on Circuits and Systems for Video Technology.

[32]  Lin Yuan,et al.  LaSOT: A High-quality Large-scale Single Object Tracking Benchmark , 2020, International Journal of Computer Vision.

[33]  Garrick Orchard,et al.  e-TLD: Event-Based Framework for Dynamic Object Tracking , 2020, IEEE Transactions on Circuits and Systems for Video Technology.

[34]  Xin Zhao,et al.  GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Yunhao Liu,et al.  Making Sense of Spatio-Temporal Preserving Representations for EEG-Based Human Intention Recognition , 2020, IEEE Transactions on Cybernetics.

[37]  Chen Cai,et al.  A Note on Over-Smoothing for Graph Neural Networks , 2020, ArXiv.

[38]  Zhipeng Zhang,et al.  Ocean: Object-aware Anchor-free Tracking , 2020, ECCV.

[39]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[40]  Lina Yao,et al.  A Semisupervised Recurrent Convolutional Attention Model for Human Activity Recognition , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[41]  Hai Zhao,et al.  Data-dependent Gaussian Prior Objective for Language Generation , 2020, ICLR.

[42]  Dong Wang,et al.  High-Performance Long-Term Tracking With Meta-Updater , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Luc Van Gool,et al.  Probabilistic Regression for Visual Tracking , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  L. Gool,et al.  Know Your Surroundings: Exploiting Scene Information for Object Tracking , 2020, ECCV.

[45]  Shengping Zhang,et al.  Siamese Box Adaptive Network for Visual Tracking , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Philip H. S. Torr,et al.  Siam R-CNN: Visual Tracking by Re-Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Zhaohui Zheng,et al.  Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression , 2019, AAAI.

[48]  Ying Cui,et al.  SiamCAR: Siamese Fully Convolutional Classification and Regression for Visual Tracking , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Gang Yu,et al.  SiamFC++: Towards Robust and Accurate Visual Tracking with Target Estimation Guidelines , 2019, AAAI.

[50]  Jian Zhang,et al.  Learning Local-Global Multi-Graph Descriptors for RGB-T Object Tracking , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[51]  Fahad Shahbaz Khan,et al.  Learning the Model Update for Siamese Trackers , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[52]  Ting Liu,et al.  Gaussian Transformer: A Lightweight Approach for Natural Language Inference , 2019, AAAI.

[53]  Takanori Maehara,et al.  Revisiting Graph Neural Networks: All We Have is Low-Pass Filters , 2019, ArXiv.

[54]  L. Gool,et al.  Learning Discriminative Model Prediction for Tracking , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[55]  Wei Wu,et al.  SiamRPN++: Evolution of Siamese Visual Tracking With Very Deep Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Michael Felsberg,et al.  ATOM: Accurate Tracking by Overlap Maximization , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Fan Yang,et al.  LaSOT: A High-Quality Benchmark for Large-Scale Single Object Tracking , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[59]  Wei Wu,et al.  High Performance Visual Tracking with Siamese Region Proposal Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[60]  Xiao Wang,et al.  SINT++: Robust Visual Tracking via Adversarial Positive Instance Generation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[61]  Huchuan Lu,et al.  Deep visual tracking: Review and experimental comparison , 2018, Pattern Recognit..

[62]  Bernard Ghanem,et al.  TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild , 2018, ECCV.

[63]  Qinghua Zheng,et al.  An Adaptive Semisupervised Feature Analysis for Video Semantic Recognition , 2018, IEEE Transactions on Cybernetics.

[64]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[65]  Simon Lucey,et al.  Need for Speed: A Benchmark for Higher Frame Rate Object Tracking , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[66]  Bernard Ghanem,et al.  A Benchmark and Simulator for UAV Tracking , 2016, ECCV.

[67]  Luca Bertinetto,et al.  Fully-Convolutional Siamese Networks for Object Tracking , 2016, ECCV Workshops.

[68]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Bohyung Han,et al.  Learning Multi-domain Convolutional Neural Networks for Visual Tracking , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[71]  Ming-Hsuan Yang,et al.  Object Tracking Benchmark , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[72]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[73]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[74]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[75]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[76]  Carl E. Rasmussen,et al.  The Infinite Gaussian Mixture Model , 1999, NIPS.