Capsule-based Object Tracking with Natural Language Specification

Tracking with Natural-Language Specification (TNL) is a joint topic of understanding the vision and natural language with a wide range of applications. In previous works, the communication between two heterogeneous features of vision and language is mainly through a simple dynamic convolution. However, the performance of prior works is capped by the difficulty of linguistic variation of natural language in modeling the dynamically changing target and its surroundings. In the meanwhile, natural language and vision are firstly fused and then utilized for tracking, which is hard to model the query-focused context. Query-focused should pay more attention to context modeling to promote the correlation between these two features. To address these issues, we propose a capsule-based network, referred to as CapsuleTNL, which performs regression tracking with natural language query. In the beginning, the visual and textual input is encoded with capsules, which can not only establish the relationship between entities but also the relationship between the parts of the entity itself. Then, we devise two interaction routing modules, which consist of visual-textual routing module to reduce the linguistic variation of input query and textual-visual routing module to precisely incorporate query-based visual cues simultaneously. To validate the potential of the proposed network for visual object tracking, we evaluate our method on two large tracking benchmarks. The experimental evaluation demonstrates the effectiveness of our capsule-based network.

[1]  Xiaogang Wang,et al.  Visual Tracking with Fully Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  Haibin Ling,et al.  Siamese Cascaded Region Proposal Networks for Real-Time Visual Tracking , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Luc Van Gool,et al.  Probabilistic Regression for Visual Tracking , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[5]  Xiaogang Wang,et al.  STCT: Sequentially Training Convolutional Networks for Visual Tracking , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Mubarak Shah,et al.  VideoCapsuleNet: A Simplified Network for Action Detection , 2018, NeurIPS.

[7]  Chung Choo Chung,et al.  Sequence-to-Sequence Prediction of Vehicle Trajectory via LSTM Encoder-Decoder Architecture , 2018, 2018 IEEE Intelligent Vehicles Symposium (IV).

[8]  Wei Wu,et al.  SiamRPN++: Evolution of Siamese Visual Tracking With Very Deep Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Wei Wu,et al.  High Performance Visual Tracking with Siamese Region Proposal Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Arnold W. M. Smeulders,et al.  Tracking by Natural Language Specification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Luca Bertinetto,et al.  Fully-Convolutional Siamese Networks for Object Tracking , 2016, ECCV Workshops.

[12]  Mubarak Shah,et al.  Visual-Textual Capsule Routing for Text-Based Video Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Bohyung Han,et al.  Learning Multi-domain Convolutional Neural Networks for Visual Tracking , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Zhipeng Zhang,et al.  Deeper and Wider Siamese Networks for Real-Time Visual Tracking , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[16]  Josef Kittler,et al.  Joint Group Feature Selection and Discriminative Filter Learning for Robust Visual Object Tracking , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Luca Bertinetto,et al.  End-to-End Representation Learning for Correlation Filter Based Tracking , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[19]  Jun Cheng,et al.  RiFeGAN: Rich Feature Generation for Text-to-Image Synthesis From Prior Knowledge , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Peng Lu,et al.  Learning Aberrance Repressed Correlation Filters for Real-Time UAV Tracking , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Bruce A. Draper,et al.  Visual object tracking using adaptive correlation filters , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[22]  Dingwen Zhang,et al.  Employing Deep Part-Object Relationships for Salient Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Huchuan Lu,et al.  GradNet: Gradient-Guided Network for Visual Object Tracking , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Geoffrey E. Hinton,et al.  Dynamic Routing Between Capsules , 2017, NIPS.

[26]  Qiang Wang,et al.  Fast Online Object Tracking and Segmentation: A Unifying Approach , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Stan Sclaroff,et al.  Robust Visual Object Tracking with Natural Language Region Proposal Network , 2019, ArXiv.

[28]  Mubarak Shah,et al.  CapsuleVOS: Semi-Supervised Video Object Segmentation Using Capsule Routing , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Chang D. Yoo,et al.  VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval , 2020, ECCV.

[30]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Rynson W. H. Lau,et al.  CREST: Convolutional Residual Learning for Visual Tracking , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  HeKaiming,et al.  Faster R-CNN , 2017 .

[33]  Ulas Bagci,et al.  Capsules for Object Segmentation , 2018, ArXiv.

[34]  L. Gool,et al.  Learning Discriminative Model Prediction for Tracking , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Fan Yang,et al.  LaSOT: A High-Quality Benchmark for Large-Scale Single Object Tracking , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Michael Felsberg,et al.  ECO: Efficient Convolution Operators for Tracking , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[38]  Zhe Gan,et al.  TIGEr: Text-to-Image Grounding for Image Caption Evaluation , 2019, EMNLP.

[39]  Weilin Huang,et al.  Deformable Siamese Attention Networks for Visual Object Tracking , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Song Wang,et al.  Learning Dynamic Siamese Network for Visual Object Tracking , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[41]  Michael Felsberg,et al.  ATOM: Accurate Tracking by Overlap Maximization , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Philip H.S. Torr,et al.  Siam R-CNN: Visual Tracking by Re-Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Larry S. Davis,et al.  MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Huchuan Lu,et al.  Dual Deep Network for Visual Tracking , 2016, IEEE Transactions on Image Processing.

[45]  Bohyung Han,et al.  Real-Time MDNet , 2018, ECCV.

[46]  Bingbing Ni,et al.  Deep Regression Tracking with Shrinkage Loss , 2018, ECCV.

[47]  Lexing Xie,et al.  Transform and Tell: Entity-Aware News Image Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Geoffrey E. Hinton,et al.  Matrix capsules with EM routing , 2018, ICLR.

[49]  Rynson W. H. Lau,et al.  VITAL: VIsual Tracking via Adversarial Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[50]  Rongrong Ji,et al.  Siamese Box Adaptive Network for Visual Tracking , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Amit K. Roy-Chowdhury,et al.  Weakly Supervised Video Moment Retrieval From Text Queries , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).