ROAM: Recurrently Optimizing Tracking Model

In this paper, we design a tracking model consisting of response generation and bounding box regression, where the first component produces a heat map to indicate the presence of the object at different positions and the second part regresses the relative bounding box shifts to anchors mounted on sliding-window locations. Thanks to the resizable convolutional filters used in both components to adapt to the shape changes of objects, our tracking model does not need to enumerate different sized anchors, thus saving model parameters. To effectively adapt the model to appearance variations, we propose to offline train a recurrent neural optimizer to update tracking model in a meta-learning setting, which can converge the model in a few gradient steps. This improves the convergence speed of updating the tracking model while achieving better performance. We extensively evaluate our trackers, ROAM and ROAM++, on the OTB, VOT, LaSOT, GOT-10K and TrackingNet benchmark and our methods perform favorably against state-of-the-art algorithms.

[1]  Hong Yu,et al.  Meta Networks , 2017, ICML.

[2]  Wei Liu,et al.  Unsupervised Deep Tracking , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Huchuan Lu,et al.  Structured Siamese Network for Real-Time Visual Tracking , 2018, ECCV.

[4]  Bernard Ghanem,et al.  TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild , 2018, ECCV.

[5]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[6]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[7]  Wei Wu,et al.  SiamRPN++: Evolution of Siamese Visual Tracking With Very Deep Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Jiri Matas,et al.  Discriminative Correlation Filter with Channel and Spatial Reliability , 2017, CVPR.

[9]  Pieter Abbeel,et al.  Meta Learning Shared Hierarchies , 2017, ICLR.

[10]  Feng Li,et al.  Learning Spatial-Temporal Regularized Correlation Filters for Visual Tracking , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Rui Caseiro,et al.  High-Speed Tracking with Kernelized Correlation Filters , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Michael Felsberg,et al.  The Visual Object Tracking VOT2017 Challenge Results , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[13]  Zhipeng Zhang,et al.  Deeper and Wider Siamese Networks for Real-Time Visual Tracking , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[15]  Luca Bertinetto,et al.  Fully-Convolutional Siamese Networks for Object Tracking , 2016, ECCV Workshops.

[16]  Haibin Ling,et al.  Siamese Cascaded Region Proposal Networks for Real-Time Visual Tracking , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Richard J. Mammone,et al.  Meta-neural networks that learn by learning , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[18]  Ming-Hsuan Yang,et al.  Object Tracking Benchmark , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Michael Felsberg,et al.  Beyond Correlation Filters: Learning Continuous Convolution Operators for Visual Tracking , 2016, ECCV.

[20]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[21]  Michael Felsberg,et al.  ECO: Efficient Convolution Operators for Tracking , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Bingbing Ni,et al.  Deep Regression Tracking with Shrinkage Loss , 2018, ECCV.

[23]  Marcin Andrychowicz,et al.  Learning to learn by gradient descent by gradient descent , 2016, NIPS.

[24]  Paul Tseng,et al.  An Incremental Gradient(-Projection) Method with Momentum Term and Adaptive Stepsize Rule , 1998, SIAM J. Optim..

[25]  Luca Bertinetto,et al.  End-to-End Representation Learning for Correlation Filter Based Tracking , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Antoni B. Chan,et al.  Learning Dynamic Memory Networks for Object Tracking , 2018, ECCV.

[27]  Wei Wu,et al.  High Performance Visual Tracking with Siamese Region Proposal Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[29]  Zhenyu He,et al.  The Visual Object Tracking VOT2016 Challenge Results , 2016, ECCV Workshops.

[30]  Bohyung Han,et al.  Learning Multi-domain Convolutional Neural Networks for Visual Tracking , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Junseok Kwon,et al.  Deep Meta Learning for Real-Time Target-Aware Visual Tracking , 2017, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Rynson W. H. Lau,et al.  CREST: Convolutional Residual Learning for Visual Tracking , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[33]  Luca Bertinetto,et al.  Staple: Complementary Learners for Real-Time Tracking , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Yoshua Bengio,et al.  On the Optimization of a Synaptic Learning Rule , 2007 .

[35]  Xin Zhao,et al.  GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Michael Felsberg,et al.  The Sixth Visual Object Tracking VOT2018 Challenge Results , 2018, ECCV Workshops.

[37]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[38]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[39]  Bohyung Han,et al.  Real-Time MDNet , 2018, ECCV.

[40]  Misha Denil,et al.  Learned Optimizers that Scale and Generalize , 2017, ICML.

[41]  Fan Yang,et al.  LaSOT: A High-Quality Benchmark for Large-Scale Single Object Tracking , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Rynson W. H. Lau,et al.  VITAL: VIsual Tracking via Adversarial Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43]  Alexander C. Berg,et al.  Meta-Tracker: Fast and Robust Online Adaptation for Visual Object Trackers , 2018, ECCV.

[44]  Silvio Savarese,et al.  Learning to Track at 100 FPS with Deep Regression Networks , 2016, ECCV.

[45]  Arnold W. M. Smeulders,et al.  UvA-DARE (Digital Academic Repository) Siamese Instance Search for Tracking , 2016 .

[46]  Wei Wu,et al.  Distractor-aware Siamese Networks for Visual Object Tracking , 2018, ECCV.

[47]  Ming-Hsuan Yang,et al.  Hierarchical Convolutional Features for Visual Tracking , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[48]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[49]  Simon Lucey,et al.  Correlation filters with limited boundaries , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[51]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Jian Li,et al.  Learning Gradient Descent: Better Generalization and Longer Horizons , 2017, ICML.

[53]  Antoni B. Chan,et al.  Recurrent Filter Learning for Visual Tracking , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[54]  Hugo Larochelle,et al.  Optimization as a Model for Few-Shot Learning , 2016, ICLR.

[55]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[56]  Antoni B. Chan,et al.  Visual Tracking via Dynamic Memory Networks , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[57]  Xin Pan,et al.  YouTube-BoundingBoxes: A Large High-Precision Human-Annotated Data Set for Object Detection in Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Bohyung Han,et al.  BranchOut: Regularization for Online Ensemble Tracking with Convolutional Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Song Wang,et al.  Learning Dynamic Siamese Network for Visual Object Tracking , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).