Towards More Flexible and Accurate Object Tracking with Natural Language: Algorithms and Benchmark

Tracking by natural language specification is a new rising research topic that aims at locating the target object in the video sequence based on its language description. Compared with traditional bounding box (BBox) based tracking, this setting guides object tracking with high-level semantic information, addresses the ambiguity of BBox, and links local and global search organically together. Those benefits may bring more flexible, robust and accurate tracking performance in practical scenarios. However, existing natural language initialized trackers are developed and compared on benchmark datasets proposed for tracking-by-BBox, which can’t reflect the true power of tracking-by-language. In this work, we propose a new benchmark specifically dedicated to the tracking-by-language, including a large scale dataset, strong and diverse baseline methods. Specifically, we collect 2k video sequences (contains a total of 1,244,340 frames, 663 words) and split 1300/700 for the train/testing respectively. We densely annotate one sentence in English and corresponding bounding boxes of the target object for each video. We also introduce two new challenges into TNL2K for the object tracking task, i.e., adversarial samples and modality switch. A strong baseline method based on an adaptive local-global-search scheme is proposed for future works to compare. We believe this benchmark will greatly boost related researches on natural language guided tracking.

[1]  Michael Felsberg,et al.  ATOM: Accurate Tracking by Overlap Maximization , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Philip H.S. Torr,et al.  Siam R-CNN: Visual Tracking by Re-Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Dit-Yan Yeung,et al.  Visual Object Tracking for Unmanned Aerial Vehicles: A Benchmark and New Motion Models , 2017, AAAI.

[4]  Jiebo Luo,et al.  Grounding-Tracking-Integration , 2019, ArXiv.

[5]  Fan Yang,et al.  LaSOT: A High-Quality Benchmark for Large-Scale Single Object Tracking , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Xin Zhao,et al.  GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Jin Young Choi,et al.  Action-Decision Networks for Visual Tracking with Deep Reinforcement Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Zhipeng Zhang,et al.  Ocean: Object-aware Anchor-free Tracking , 2020, ECCV.

[9]  Bernard Ghanem,et al.  TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild , 2018, ECCV.

[10]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[11]  Xin Pan,et al.  YouTube-BoundingBoxes: A Large High-Precision Human-Annotated Data Set for Object Detection in Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Jian-Huang Lai,et al.  RGB-Infrared Cross-Modality Person Re-identification , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13]  Luc Van Gool,et al.  Know Your Surroundings: Exploiting Scene Information for Object Tracking , 2020, ECCV.

[14]  Bin Luo,et al.  Learning Target-aware Attention for Robust Tracking with Conditional Adversarial Network , 2019, BMVC.

[15]  Vibhav Vineet,et al.  Struck: Structured Output Tracking with Kernels , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[17]  Wei Wu,et al.  SiamRPN++: Evolution of Siamese Visual Tracking With Very Deep Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Arnold W. M. Smeulders,et al.  Tracking by Natural Language Specification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Zuoxin Li,et al.  SiamFC++: Towards Robust and Accurate Visual Tracking with Target Estimation Guidelines , 2020, AAAI.

[20]  Luca Bertinetto,et al.  Fully-Convolutional Siamese Networks for Object Tracking , 2016, ECCV Workshops.

[21]  Bohyung Han,et al.  Learning Multi-domain Convolutional Neural Networks for Visual Tracking , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Huchuan Lu,et al.  GradNet: Gradient-Guided Network for Visual Object Tracking , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Xiaochun Cao,et al.  Efficient Adversarial Attacks for Visual Object Tracking , 2020, ECCV.

[24]  Qiang Wang,et al.  Fast Online Object Tracking and Segmentation: A Unifying Approach , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Silvio Savarese,et al.  Learning to Track at 100 FPS with Deep Regression Networks , 2016, ECCV.

[26]  Bin Yan,et al.  Cooling-Shrinking Attack: Blinding the Tracker With Imperceptible Noises , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Stan Sclaroff,et al.  Robust Visual Object Tracking with Natural Language Region Proposal Network , 2019, ArXiv.

[28]  Erik Blasch,et al.  Encoding color information for visual tracking: Algorithms and benchmark , 2015, IEEE Transactions on Image Processing.

[29]  Simon Lucey,et al.  Need for Speed: A Benchmark for Higher Frame Rate Object Tracking , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[30]  Zhongfei Zhang,et al.  A survey of appearance models in visual object tracking , 2013, ACM Trans. Intell. Syst. Technol..

[31]  Anqi Xu,et al.  Physical Adversarial Textures That Fool Visual Object Tracking , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Michael Felsberg,et al.  Learning Spatially Regularized Correlation Filters for Visual Tracking , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[33]  Yonghong Tian,et al.  Dynamic Attention Guided Multi-Trajectory Analysis for Single Object Tracking , 2021, IEEE Transactions on Circuits and Systems for Video Technology.

[34]  Huchuan Lu,et al.  Deep visual tracking: Review and experimental comparison , 2018, Pattern Recognit..

[35]  Shaogang Gong,et al.  RGB-IR Person Re-identification by Cross-Modality Similarity Preservation , 2020, International Journal of Computer Vision.

[36]  Xiaokang Yang,et al.  Robust Tracking against Adversarial Attacks , 2020, ECCV.

[37]  Qiang Li,et al.  Siamese Keypoint Prediction Network for Visual Object Tracking , 2020, ArXiv.

[38]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[39]  Jiri Matas,et al.  D3S – A Discriminative Single Shot Segmentation Tracker , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  L. Gool,et al.  Learning Discriminative Model Prediction for Tracking , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[41]  Jiri Matas,et al.  Discriminative Correlation Filter with Channel and Spatial Reliability , 2017, CVPR.

[42]  Bohyung Han,et al.  BranchOut: Regularization for Online Ensemble Tracking with Convolutional Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Wei Liu,et al.  Unsupervised Deep Tracking , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Jiri Matas,et al.  A Novel Performance Evaluation Methodology for Single-Target Trackers , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Rynson W. H. Lau,et al.  VITAL: VIsual Tracking via Adversarial Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Jiebo Luo,et al.  A Fast and Accurate One-Stage Approach to Visual Grounding , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[47]  Simone Calderara,et al.  Visual Tracking: An Experimental Survey , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Ali Farhadi,et al.  YOLOv3: An Incremental Improvement , 2018, ArXiv.

[49]  Zhenyu He,et al.  PTB-TIR: A Thermal Infrared Pedestrian Tracking Benchmark , 2018, IEEE Transactions on Multimedia.

[50]  Dong Wang,et al.  High-Performance Long-Term Tracking With Meta-Updater , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Limin Wang,et al.  Fully Convolutional Online Tracking , 2020, Comput. Vis. Image Underst..

[52]  Michael Felsberg,et al.  Beyond Correlation Filters: Learning Continuous Convolution Operators for Visual Tracking , 2016, ECCV.

[53]  Bohyung Han,et al.  Real-Time MDNet , 2018, ECCV.

[54]  Antoni B. Chan,et al.  Learning Dynamic Memory Networks for Object Tracking , 2018, ECCV.

[55]  Ling Shao,et al.  CLNet: A Compact Latent Network for Fast Adjusting Siamese Trackers , 2020, ECCV.

[56]  Fahad Shahbaz Khan,et al.  Learning the Model Update for Siamese Trackers , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[57]  Wei Wu,et al.  High Performance Visual Tracking with Siamese Region Proposal Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[58]  Bernard Ghanem,et al.  A Benchmark and Simulator for UAV Tracking , 2016, ECCV.

[59]  Junseok Kwon,et al.  Visual Tracking by TridentAlign and Context Embedding , 2020, ACCV.

[60]  Zhipeng Zhang,et al.  Deeper and Wider Siamese Networks for Real-Time Visual Tracking , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Jin Tang,et al.  RGB-T Object Tracking: Benchmark and Baseline , 2018, Pattern Recognit..

[62]  Michael Felsberg,et al.  ECO: Efficient Convolution Operators for Tracking , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Arnold W. M. Smeulders,et al.  Long-term Tracking in the Wild: A Benchmark , 2018, ECCV.

[64]  Xin Zhao,et al.  GlobalTrack: A Simple and Strong Baseline for Long-term Tracking , 2019, AAAI.

[65]  Alexander C. Berg,et al.  Meta-Tracker: Fast and Robust Online Adaptation for Visual Object Trackers , 2018, ECCV.

[66]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[67]  Luc Van Gool,et al.  Probabilistic Regression for Visual Tracking , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Luca Bertinetto,et al.  Staple: Complementary Learners for Real-Time Tracking , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Wei Wu,et al.  Distractor-aware Siamese Networks for Visual Object Tracking , 2018, ECCV.

[70]  Pengfei Xu,et al.  ROAM: Recurrently Optimizing Tracking Model , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[71]  Huchuan Lu,et al.  ‘Skimming-Perusal’ Tracking: A Framework for Real-Time and Robust Long-Term Tracking , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[72]  Ming-Hsuan Yang,et al.  Object Tracking Benchmark , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[73]  Xiao Wang,et al.  Describe and Attend to Track: Learning Natural Language guided Structural Representation and Visual Attention for Object Tracking , 2018, ArXiv.

[74]  Rongrong Ji,et al.  Siamese Box Adaptive Network for Visual Tracking , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[75]  Rui Caseiro,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence High-speed Tracking with Kernelized Correlation Filters , 2022 .

[76]  Ming-Hsuan Yang,et al.  Visual tracking with online Multiple Instance Learning , 2009, CVPR.

[77]  Shuicheng Yan,et al.  NUS-PRO: A New Visual Tracking Challenge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[78]  Ying Cui,et al.  SiamCAR: Siamese Fully Convolutional Classification and Regression for Visual Tracking , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[79]  Xiao Wang,et al.  SINT++: Robust Visual Tracking via Adversarial Positive Instance Generation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[80]  M. Shah,et al.  Object tracking: A survey , 2006, CSUR.

[81]  Zhiwei Xiong,et al.  Tracking by Instance Detection: A Meta-Learning Approach , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).