Visual Tracking via Dynamic Memory Networks

Template-matching methods for visual tracking have gained popularity recently due to their good performance and fast speed. However, they lack effective ways to adapt to changes in the target object's appearance, making their tracking accuracy still far from state-of-the-art. In this paper, we propose a dynamic memory network to adapt the template to the target's appearance variations during tracking. The reading and writing process of the external memory is controlled by an LSTM network with the search feature map as input. A spatial attention mechanism is applied to concentrate the LSTM input on the potential target as the location of the target is at first unknown. To prevent aggressive model adaptivity, we apply gated residual template learning to control the amount of retrieved memory that is used to combine with the initial template. In order to alleviate the drift problem, we also design a “negative” memory unit that stores templates for distractors, which are used to cancel out wrong responses from the object template. To further boost the tracking performance, an auxiliary classification loss is added after the feature extractor part. Unlike tracking-by-detection methods where the object's information is maintained by the weight parameters of neural networks, which requires expensive online fine-tuning to be adaptable, our tracker runs completely feed-forward and adapts to the target's appearance changes by updating the external memory. Moreover, the capacity of our model is not determined by the network size as with other trackers – the capacity can be easily enlarged as the memory requirements of a task increase, which is favorable for memorizing long-term object information. Extensive experiments on the OTB and VOT datasets demonstrate that our trackers perform favorably against state-of-the-art tracking methods while retaining real-time speed.

[1]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[2]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[3]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[4]  Rui Caseiro,et al.  High-Speed Tracking with Kernelized Correlation Filters , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Simon Lucey,et al.  Learning Policies for Adaptive Tracking with Deep Feature Cascades , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[6]  Chi-Keung Tang,et al.  MAVOT: Memory-Augmented Video Object Tracking , 2017, ArXiv.

[7]  Jiri Matas,et al.  Discriminative Correlation Filter with Channel and Spatial Reliability , 2017, CVPR.

[8]  Bohyung Han,et al.  BranchOut: Regularization for Online Ensemble Tracking with Convolutional Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Song Wang,et al.  Learning Dynamic Siamese Network for Visual Object Tracking , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[11]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[12]  Haibin Ling,et al.  Parallel Tracking and Verifying: A Framework for Real-Time and High Accuracy Visual Tracking , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13]  Luca Bertinetto,et al.  Fully-Convolutional Siamese Networks for Object Tracking , 2016, ECCV Workshops.

[14]  Horst Bischof,et al.  Semi-supervised On-Line Boosting for Robust Tracking , 2008, ECCV.

[15]  Rob Fergus,et al.  Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[16]  Luca Bertinetto,et al.  Staple: Complementary Learners for Real-Time Tracking , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Qingming Huang,et al.  Hedged Deep Tracking , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Cordelia Schmid,et al.  Online Object Tracking with Proposal Selection , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[19]  Ming-Hsuan Yang,et al.  Learning Spatial-Aware Regressions for Visual Tracking , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Changsheng Xu,et al.  Multi-task Correlation Particle Filter for Robust Object Tracking , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Yong Liu,et al.  Large Margin Object Tracking with Circulant Feature Maps , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Chong Luo,et al.  A Twofold Siamese Network for Real-Time Object Tracking , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Michael Felsberg,et al.  The Visual Object Tracking VOT2017 Challenge Results , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[24]  Yi Wu,et al.  Online Object Tracking: A Benchmark , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Ming-Hsuan Yang,et al.  Hierarchical Convolutional Features for Visual Tracking , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[27]  Silvio Savarese,et al.  Learning to Track at 100 FPS with Deep Regression Networks , 2016, ECCV.

[28]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[29]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Dit-Yan Yeung,et al.  A Convex Formulation for Learning Task Relationships in Multi-Task Learning , 2010, UAI.

[32]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[33]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[34]  Arnold W. M. Smeulders,et al.  UvA-DARE (Digital Academic Repository) Siamese Instance Search for Tracking , 2016 .

[35]  Huchuan Lu,et al.  Deep visual tracking: Review and experimental comparison , 2018, Pattern Recognit..

[36]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Alex Graves,et al.  Neural Turing Machines , 2014, ArXiv.

[38]  Bohyung Han,et al.  Modeling and Propagating CNNs in a Tree Structure for Visual Tracking , 2016, ArXiv.

[39]  Ming-Hsuan Yang,et al.  Object Tracking Benchmark , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Michael Felsberg,et al.  Convolutional Features for Correlation Filter Based Visual Tracking , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[41]  Antoni B. Chan,et al.  Recurrent Filter Learning for Visual Tracking , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[42]  Bohyung Han,et al.  Learning Multi-domain Convolutional Neural Networks for Visual Tracking , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[44]  Huchuan Lu,et al.  Dual Deep Network for Visual Tracking , 2016, IEEE Transactions on Image Processing.

[45]  Michael Felsberg,et al.  Beyond Correlation Filters: Learning Continuous Convolution Operators for Visual Tracking , 2016, ECCV.

[46]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[47]  Seunghoon Hong,et al.  Learning Deconvolution Network for Semantic Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[48]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[49]  Luca Bertinetto,et al.  End-to-End Representation Learning for Correlation Filter Based Tracking , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Andrew Zisserman,et al.  Robust Object Tracking , 2001 .

[51]  Zhenyu He,et al.  The Visual Object Tracking VOT2016 Challenge Results , 2016, ECCV Workshops.

[52]  Sergio Gomez Colmenarejo,et al.  Hybrid computing using a neural network with dynamic external memory , 2016, Nature.

[53]  Michael Felsberg,et al.  Learning Spatially Regularized Correlation Filters for Visual Tracking , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[54]  Jason Weston,et al.  Memory Networks , 2014, ICLR.

[55]  Brian Kingsbury,et al.  New types of deep neural network learning for speech recognition and related applications: an overview , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[56]  Daan Wierstra,et al.  One-shot Learning with Memory-Augmented Neural Networks , 2016, ArXiv.

[57]  Wei Wu,et al.  High Performance Visual Tracking with Siamese Region Proposal Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[58]  한보형,et al.  Learning Deconvolution Network for Semantic Segmentation , 2015 .

[59]  Roberto Cipolla,et al.  Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[60]  A. Aydın Alatan,et al.  Good Features to Correlate for Visual Tracking , 2017, IEEE Transactions on Image Processing.

[61]  Rynson W. H. Lau,et al.  CREST: Convolutional Residual Learning for Visual Tracking , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[62]  Antoni B. Chan,et al.  Heterogeneous Multi-task Learning for Human Pose Estimation with Deep Convolutional Neural Network , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[63]  Hongdong Li,et al.  Beyond Local Search: Tracking Objects Everywhere with Instance-Specific Proposals , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Michael Felsberg,et al.  Accurate Scale Estimation for Robust Visual Tracking , 2014, BMVC.

[65]  Junliang Xing,et al.  Learning Attentions: Residual Attentional Siamese Network for High Performance Online Visual Tracking , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[66]  Antoni B. Chan,et al.  Learning Dynamic Memory Networks for Object Tracking , 2018, ECCV.

[67]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[68]  Michael Felsberg,et al.  Adaptive Decontamination of the Training Set: A Unified Formulation for Discriminative Visual Tracking , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Michael Felsberg,et al.  The Visual Object Tracking VOT2015 Challenge Results , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[70]  Zdenek Kalal,et al.  Tracking-Learning-Detection , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[71]  Ming-Hsuan Yang,et al.  Robust Object Tracking with Online Multiple Instance Learning , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[72]  Yiannis Demiris,et al.  Attentional Correlation Filter Network for Adaptive Visual Tracking , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[73]  Michael Felsberg,et al.  ECO: Efficient Convolution Operators for Tracking , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[74]  Sanja Fidler,et al.  Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[75]  Xiaogang Wang,et al.  Visual Tracking with Fully Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[76]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.