ZoomTrack: Target-aware Non-uniform Resizing for Efficient Visual Tracking

Recently, the transformer has enabled the speed-oriented trackers to approach state-of-the-art (SOTA) performance with high-speed thanks to the smaller input size or the lighter feature extraction backbone, though they still substantially lag behind their corresponding performance-oriented versions. In this paper, we demonstrate that it is possible to narrow or even close this gap while achieving high tracking speed based on the smaller input size. To this end, we non-uniformly resize the cropped image to have a smaller input size while the resolution of the area where the target is more likely to appear is higher and vice versa. This enables us to solve the dilemma of attending to a larger visual field while retaining more raw information for the target despite a smaller input size. Our formulation for the non-uniform resizing can be efficiently solved through quadratic programming (QP) and naturally integrated into most of the crop-based local trackers. Comprehensive experiments on five challenging datasets based on two kinds of transformer trackers, \ie, OSTrack and TransT, demonstrate consistent improvements over them. In particular, applying our method to the speed-oriented version of OSTrack even outperforms its performance-oriented counterpart by 0.6% AUC on TNL2K, while running 50% faster and saving over 55% MACs. Codes and models are available at https://github.com/Kou-99/ZoomTrack.

[1]  Huchuan Lu,et al.  Exploring Lightweight Hierarchical Vision Transformers for Efficient Visual Tracking , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  D. Ramanan,et al.  Learning to Zoom and Unzoom , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Zeming Li,et al.  A Closer Look at Self-supervised Lightweight Vision Transformers , 2022, ICML.

[4]  F. Porikli,et al.  SALISA: Saliency-based Input Sampling for Efficient Video Object Detection , 2022, ECCV.

[5]  S. Shan,et al.  Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework , 2022, ECCV.

[6]  Limin Wang,et al.  MixFormer: End-to-End Tracking with Iterative Mixed Attention , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  L. Gool,et al.  Transforming Model Prediction for Tracking , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Wanli Ouyang,et al.  Backbone is All Your Need: A Simplified Architecture for Visual Object Tracking , 2022, ECCV.

[9]  Liping Jing,et al.  Learning Target-aware Representation for Visual Tracking via Informative Interactions , 2022, IJCAI.

[10]  L. Gool,et al.  Efficient Visual Tracking with Exemplar Transformers , 2021, 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[11]  T. Martyniuk,et al.  FEAR: Fast, Efficient, Accurate and Robust Visual Tracker , 2021, ECCV.

[12]  Haibin Ling,et al.  SwinTrack: A Simple and Strong Baseline for Transformer Tracking , 2021, NeurIPS.

[13]  Ross B. Girshick,et al.  Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Deva Ramanan,et al.  FOVEA: Foveated Image Magnification for Autonomous Navigation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Yihao Liu,et al.  Learn to Match: Automatic Matching Network Design for Visual Tracking , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Jianlong Fu,et al.  LightTrack: Finding Lightweight Neural Networks for Object Tracking via One-Shot Architecture Search , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Matthijs Douze,et al.  LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Yonghong Tian,et al.  Towards More Flexible and Accurate Object Tracking with Natural Language: Algorithms and Benchmark , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Jianlong Fu,et al.  Learning Spatio-Temporal Transformer for Visual Tracking , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Huchuan Lu,et al.  Transformer Tracking , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Lin Yuan,et al.  LaSOT: A High-quality Large-scale Single Object Tracking Benchmark , 2020, International Journal of Computer Vision.

[22]  F. Khan,et al.  Distilled Siamese Networks for Visual Tracking , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Li Zhaoping,et al.  A new framework for understanding vision from the perspective of the primary visual cortex , 2019, Current Opinion in Neurobiology.

[24]  L. Gool,et al.  Learning Discriminative Model Prediction for Tracking , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Jiebo Luo,et al.  Looking for the Devil in the Details: Learning Trilinear Attention Sampling Network for Fine-Grained Image Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Zhipeng Zhang,et al.  Deeper and Wider Siamese Networks for Real-Time Visual Tracking , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Wei Wu,et al.  SiamRPN++: Evolution of Siamese Visual Tracking With Very Deep Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Kaiqi Huang,et al.  GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Fan Yang,et al.  LaSOT: A High-Quality Benchmark for Large-Scale Single Object Tracking , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Wojciech Matusik,et al.  Learning to Zoom: a Saliency-Based Sampling Layer for Neural Networks , 2018, ECCV.

[31]  Bernard Ghanem,et al.  TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild , 2018, ECCV.

[32]  Bernard Ghanem,et al.  A Benchmark and Simulator for UAV Tracking , 2016, ECCV.

[33]  Luca Bertinetto,et al.  Fully-Convolutional Siamese Networks for Object Tracking , 2016, ECCV Workshops.

[34]  Olga Sorkine-Hornung,et al.  Robust Image Retargeting via Axis‐Aligned Deformation , 2012, Comput. Graph. Forum.

[35]  Ariel Shamir,et al.  Seam Carving for Content-Aware Image Resizing , 2007, ACM Trans. Graph..