Visual Object Tracking by Hierarchical Attention Siamese Network

Visual tracking addresses the problem of localizing an arbitrary target in video according to the annotated bounding box. In this article, we present a novel tracking method by introducing the attention mechanism into the Siamese network to increase its matching discrimination. We propose a new way to compute attention weights to improve matching performance by a sub-Siamese network [Attention Net (A-Net)], which locates attentive parts for solving the searching problem. In addition, features in higher layers can preserve more semantic information while features in lower layers preserve more location information. Thus, in order to solve the tracking failure cases by the higher layer features, we fully utilize location and semantic information by multilevel features and propose a new way to fuse multiscale response maps from each layer to obtain a more accurate position estimation of the object. We further propose a hierarchical attention Siamese network by combining the attention weights and multilayer integration for tracking. Our method is implemented with a pretrained network which can outperform most well-trained Siamese trackers even without any fine-tuning and online updating. The comparison results with the state-of-the-art methods on popular tracking benchmarks show that our method achieves better performance. Our source code and results will be available at https://github.com/shenjianbing/HASN.

[1]  Nikos Komodakis,et al.  Learning to compare image patches via convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Huchuan Lu,et al.  Visual tracking via adaptive structural local sparse appearance model , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Ling Shao,et al.  Higher Order Energies for Image Segmentation , 2017, IEEE Transactions on Image Processing.

[4]  Raquel Urtasun,et al.  Efficient Deep Learning for Stereo Matching , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Stefanos D. Kollias,et al.  An efficient fully unsupervised video object segmentation scheme using an adaptive neural-network classifier architecture , 2003, IEEE Trans. Neural Networks.

[6]  Qingming Huang,et al.  Hedged Deep Tracking , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Xuelong Li,et al.  Linear Tracking for 3-D Medical Ultrasound Imaging , 2013, IEEE Transactions on Cybernetics.

[8]  Wenbing Tao,et al.  Once for All: A Two-Flow Convolutional Neural Network for Visual Tracking , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[9]  Silvio Savarese,et al.  Learning to Track at 100 FPS with Deep Regression Networks , 2016, ECCV.

[10]  Antoni B. Chan,et al.  Recurrent Filter Learning for Visual Tracking , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[11]  Lei Zhang,et al.  Object Tracking via Dual Linear Structured SVM and Explicit Feature Map , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Jiri Matas,et al.  Discriminative Correlation Filter with Channel and Spatial Reliability , 2017, CVPR.

[13]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Vibhav Vineet,et al.  Struck: Structured Output Tracking with Kernels , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Rui Caseiro,et al.  High-Speed Tracking with Kernelized Correlation Filters , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Luca Bertinetto,et al.  Fully-Convolutional Siamese Networks for Object Tracking , 2016, ECCV Workshops.

[17]  Seunghoon Hong,et al.  Online Tracking by Learning Discriminative Saliency Map with Convolutional Neural Network , 2015, ICML.

[18]  Stan Sclaroff,et al.  MEEM: Robust Tracking via Multiple Experts Using Entropy Minimization , 2014, ECCV.

[19]  Luca Bertinetto,et al.  Staple: Complementary Learners for Real-Time Tracking , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Michael Felsberg,et al.  The Visual Object Tracking VOT2015 Challenge Results , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[21]  Bernard Ghanem,et al.  Context-Aware Correlation Filter Tracking , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Arnold W. M. Smeulders,et al.  UvA-DARE (Digital Academic Repository) Siamese Instance Search for Tracking , 2016 .

[23]  Ming-Hsuan Yang,et al.  Hierarchical Convolutional Features for Visual Tracking , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[24]  Ming-Hsuan Yang,et al.  Incremental Learning for Robust Visual Tracking , 2008, International Journal of Computer Vision.

[25]  Jiri Matas,et al.  Robust scale-adaptive mean-shift for tracking , 2013, Pattern Recognit. Lett..

[26]  Esa Rahtu,et al.  Siamese network features for image matching , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[27]  Xiaogang Jin,et al.  Quadruplet Network With One-Shot Learning for Fast Visual Object Tracking , 2017, IEEE Transactions on Image Processing.

[28]  Bruce A. Draper,et al.  Visual object tracking using adaptive correlation filters , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[29]  Ling Shao,et al.  Multiobject Tracking by Submodular Optimization , 2019, IEEE Transactions on Cybernetics.

[30]  Yiannis Demiris,et al.  Attentional Correlation Filter Network for Adaptive Visual Tracking , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Ling Shao,et al.  Visual Tracking Under Motion Blur , 2016, IEEE Transactions on Image Processing.

[32]  Jianbing Shen,et al.  Triplet Loss in Siamese Network for Object Tracking , 2018, ECCV.

[33]  Rama Chellappa,et al.  Visual tracking and recognition using appearance-adaptive models in particle filters , 2004, IEEE Transactions on Image Processing.

[34]  Usman Ullah Sheikh,et al.  Learning hierarchical representation using Siamese Convolution Neural Network for human re-identification , 2015, 2015 Tenth International Conference on Digital Information Management (ICDIM).

[35]  Jianbing Shen,et al.  Fast Online Tracking With Detection Refinement , 2018, IEEE Transactions on Intelligent Transportation Systems.

[36]  Yong Liu,et al.  Large Margin Object Tracking with Circulant Feature Maps , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Bernard Ghanem,et al.  Target Response Adaptation for Correlation Filter Tracking , 2016, ECCV.

[38]  Ling Shao,et al.  Submodular Trajectories for Better Motion Segmentation in Videos , 2018, IEEE Transactions on Image Processing.

[39]  Qiang Wang,et al.  DCFNet: Discriminant Correlation Filters Network for Visual Tracking , 2017, ArXiv.

[40]  Yi Wu,et al.  Online Object Tracking: A Benchmark , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[42]  Yann LeCun,et al.  Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[43]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[44]  Ming-Hsuan Yang,et al.  Object Tracking Benchmark , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Ling Shao,et al.  Discriminative Tracking Using Tensor Pooling , 2016, IEEE Transactions on Cybernetics.

[46]  Anastasios D. Doulamis,et al.  Dynamic tracking re-adjustment: a method for automatic tracking recovery in complex visual environments , 2010, Multimedia Tools and Applications.

[47]  Wenguan Wang,et al.  Occlusion-Aware Real-Time Object Tracking , 2017, IEEE Transactions on Multimedia.

[48]  Haibin Ling,et al.  Revisiting Video Saliency Prediction in the Deep Learning Era , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  Michael Felsberg,et al.  Learning Spatially Regularized Correlation Filters for Visual Tracking , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[50]  Yann LeCun,et al.  Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[51]  Yiannis Demiris,et al.  Visual Tracking Using Attention-Modulated Disintegration and Integration , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Haibin Ling,et al.  A Deep Network Solution for Attention and Aesthetics Aware Photo Cropping , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[53]  Xuelong Li,et al.  Lazy Random Walks for Superpixel Segmentation , 2014, IEEE Transactions on Image Processing.

[54]  Ling Shao,et al.  Real-Time Superpixel Segmentation by DBSCAN Clustering Algorithm , 2016, IEEE Transactions on Image Processing.

[55]  Xiaogang Wang,et al.  Visual Tracking with Fully Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[56]  Luca Bertinetto,et al.  End-to-End Representation Learning for Correlation Filter Based Tracking , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).