Beyond Visual Cues: Synchronously Exploring Target-Centric Semantics for Vision-Language Tracking

Single object tracking aims to locate one specific target in video sequences, given its initial state. Classical trackers rely solely on visual cues, restricting their ability to handle challenges such as appearance variations, ambiguity, and distractions. Hence, Vision-Language (VL) tracking has emerged as a promising approach, incorporating language descriptions to directly provide high-level semantics and enhance tracking performance. However, current VL trackers have not fully exploited the power of VL learning, as they suffer from limitations such as heavily relying on off-the-shelf backbones for feature extraction, ineffective VL fusion designs, and the absence of VL-related loss functions. Consequently, we present a novel tracker that progressively explores target-centric semantics for VL tracking. Specifically, we propose the first Synchronous Learning Backbone (SLB) for VL tracking, which consists of two novel modules: the Target Enhance Module (TEM) and the Semantic Aware Module (SAM). These modules enable the tracker to perceive target-related semantics and comprehend the context of both visual and textual modalities at the same pace, facilitating VL feature extraction and fusion at different semantic levels. Moreover, we devise the dense matching loss to further strengthen multi-modal representation learning. Extensive experiments on VL tracking datasets demonstrate the superiority and effectiveness of our methods.

[1]  D. Ma,et al.  Tracking by Natural Language Specification with Long Short-term Context Decoupling , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  R. Ji,et al.  Toward Unified Token Learning for Vision-Language Tracking , 2023, IEEE Transactions on Circuits and Systems for Video Technology.

[3]  Dapeng Oliver Wu,et al.  Cross-modal Orthogonal High-rank Augmentation for RGB-Event Transformer-trackers , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Xiaoping Zhou,et al.  All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment , 2023, ACM Multimedia.

[5]  Dahu Shi,et al.  Autoregressive Visual Tracking , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Jungong Han,et al.  Efficient RGB-T Tracking via Cross-Modality Distillation , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Li Zhou,et al.  Joint Visual Grounding and Tracking with Natural Language Specification , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Jiannan Wu,et al.  Universal Instance Perception as Object Discovery and Retrieval , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  F. Khan,et al.  MaPLe: Multi-modal Prompt Learning , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Thomas E. Huang,et al.  Tracking Every Thing in the Wild , 2022, ECCV.

[11]  Junsong Yuan,et al.  AiATrack: Attention in Attention for Transformer Visual Tracking , 2022, ECCV.

[12]  P. Luo,et al.  Towards Grand Unification of Object Tracking , 2022, ECCV.

[13]  Jiaya Jia,et al.  Tracking Objects as Pixel-wise Distributions , 2022, ECCV.

[14]  Li Jing,et al.  Divert More Attention to Vision-Language Tracking , 2022, NeurIPS.

[15]  D. Clifton,et al.  Multimodal Learning With Transformers: A Survey , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Zhongpeng Cai,et al.  Cross-modal Target Retrieval for Tracking by Natural Language , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[17]  Xiaoyang Tan,et al.  Robust RGB-T Tracking via Graph Attention-Based Bilinear Pooling , 2022, IEEE Transactions on Neural Networks and Learning Systems.

[18]  Yi Yang,et al.  Unified Transformer Tracker for Object Tracking , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Limin Wang,et al.  MixFormer: End-to-End Tracking with Iterative Mixed Attention , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  L. Gool,et al.  Transforming Model Prediction for Tracking , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  S. Hoi,et al.  BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.

[22]  Liping Jing,et al.  Learning Target-aware Representation for Visual Tracking via Informative Interactions , 2022, IJCAI.

[23]  Michael Felsberg,et al.  Visual Object Tracking With Discriminative Filters and Siamese Networks: A Survey and Outlook , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Haibin Ling,et al.  SwinTrack: A Simple and Strong Baseline for Transformer Tracking , 2021, NeurIPS.

[25]  Ding Ma,et al.  Capsule-based Object Tracking with Natural Language Specification , 2021, ACM Multimedia.

[26]  Yihao Liu,et al.  Learn to Match: Automatic Matching Network Design for Visual Tracking , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  S. Sclaroff,et al.  Siamese Natural Language Tracker: Tracking by Natural Language Descriptions with Siamese Trackers , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Zhenyu He,et al.  SiamCorners: Siamese Corner Networks for Visual Tracking , 2021, IEEE Transactions on Multimedia.

[29]  Jianlong Fu,et al.  Learning Spatio-Temporal Transformer for Visual Tracking , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Yonghong Tian,et al.  Towards More Flexible and Accurate Object Tracking with Natural Language: Algorithms and Benchmark , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Huchuan Lu,et al.  Transformer Tracking , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  N. Codella,et al.  CvT: Introducing Convolutions to Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[33]  Wengang Zhou,et al.  Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[35]  Jun Li,et al.  Generalized Focal Loss V2: Learning Reliable Localization Quality Estimation for Dense Object Detection , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Bin Yan,et al.  Alpha-Refine: Boosting Tracking Performance by Precise Bounding Box Estimation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Jiebo Luo,et al.  Grounding-Tracking-Integration , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[38]  Seyed Mojtaba Marvasti-Zadeh,et al.  Deep Learning for Visual Tracking: A Comprehensive Survey , 2019, IEEE Transactions on Intelligent Transportation Systems.

[39]  S. Sclaroff,et al.  Real-time Visual Object Tracking with Natural Language Description , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[40]  Fan Yang,et al.  LaSOT: A High-Quality Benchmark for Large-Scale Single Object Tracking , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Gang Sun,et al.  Squeeze-and-Excitation Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Cees G. M. Snoek,et al.  Tracking by Natural Language Specification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[44]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[45]  Ming-Hsuan Yang,et al.  Learning to recognize objects , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).