UvA-DARE (Digital Academic Repository) Siamese Instance Search for Tracking

In this paper we present a tracker, which is radically different from state-of-the-art trackers: we apply no model updating, no occlusion detection, no combination of trackers, no geometric matching, and still deliver state-of-the-art tracking performance, as demonstrated on the popular online tracking benchmark (OTB) and six very challenging YouTube videos. The presented tracker simply matches the initial patch of the target in the first frame with candidates in a new frame and returns the most similar patch by a learned matching function. The strength of the matching function comes from being extensively trained generically, i.e., without any data of the target, using a Siamese deep neural network, which we design for tracking. Once learned, the matching function is used as is, without any adapting, to track previously unseen targets. It turns out that the learned matching function is so powerful that a simple tracker built upon it, coined Siamese INstance search Tracker, SINT, which only uses the original observation of the target from the first frame, suffices to reach state-of-the-art performance. Further, we show the proposed tracker even allows for target re-identification after the target was absent for a complete video shot.

[1]  Alberto Del Bimbo,et al.  Object Tracking by Oversampling Local Features , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Yuning Jiang,et al.  Randomized visual phrases for object search , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Shih-Fu Chang,et al.  Attributes and categories for generic instance search from one example , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Jitendra Malik,et al.  Hypercolumns for object segmentation and fine-grained localization , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Simone Calderara,et al.  Visual Tracking: An Experimental Survey , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[8]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[9]  Arnold W. M. Smeulders,et al.  Locality in Generic Instance Search from One Example , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Yann LeCun,et al.  Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[11]  Zhe Chen,et al.  MUlti-Store Tracker (MUSTer): A cognitive psychology inspired approach to object tracking , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Yang Lu,et al.  Online Object Tracking, Learning and Parsing with And-Or Graphs , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Dit-Yan Yeung,et al.  Learning a Deep Compact Image Representation for Visual Tracking , 2013, NIPS.

[14]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[15]  Serge J. Belongie,et al.  Learning deep representations for ground-to-aerial geolocalization , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Haibin Ling,et al.  Robust visual tracking using ℓ1 minimization , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[17]  Yi Wu,et al.  Online Object Tracking: A Benchmark , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Rahul Sukthankar,et al.  MatchNet: Unifying feature and metric learning for patch-based matching , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Huchuan Lu,et al.  Superpixel tracking , 2011, 2011 International Conference on Computer Vision.

[20]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[21]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Uwe D. Hanebeck,et al.  Template matching using fast normalized cross correlation , 2001, SPIE Defense + Commercial Sensing.

[23]  Abhinav Gupta,et al.  Transferring Rich Feature Hierarchies for Robust Visual Tracking , 2015, ArXiv.

[24]  Lei Luo,et al.  Enable Scale and Aspect Ratio Adaptability in Visual Tracking with Detection Proposals , 2015, BMVC.

[25]  Huchuan Lu,et al.  Robust object tracking via sparsity-based collaborative model , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[27]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[28]  Jin Gao,et al.  Transfer Learning Based Visual Tracking with Gaussian Processes Regression , 2014, ECCV.

[29]  Vibhav Vineet,et al.  Struck: Structured Output Tracking with Kernels , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[31]  Ehud Rivlin,et al.  Robust Fragments-based Tracking using the Integral Histogram , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[32]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  C. Lawrence Zitnick,et al.  Edge Boxes: Locating Object Proposals from Edges , 2014, ECCV.

[34]  Luc Van Gool,et al.  Query Adaptive Similarity for Large Scale Object Retrieval , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Koen E. A. van de Sande,et al.  Codemaps - Segment, Classify and Search Objects Locally , 2013, 2013 IEEE International Conference on Computer Vision.

[36]  Yannis Avrithis,et al.  Image Search with Selective Match Kernels: Aggregation Across Single and Multiple Images , 2016, International Journal of Computer Vision.

[37]  Yann LeCun,et al.  Computing the stereo matching cost with a convolutional neural network , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[39]  Nassir Navab,et al.  Efficient Learning of Linear Predictors for Template Tracking , 2014, International Journal of Computer Vision.

[40]  Zdenek Kalal,et al.  Tracking-Learning-Detection , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Bo Hu,et al.  Robust Occlusion Handling in Object Tracking , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Horst Bischof,et al.  Hough-based tracking of non-rigid objects , 2011, 2011 International Conference on Computer Vision.

[43]  Arnold W. M. Smeulders,et al.  Robust Tracking Using Foreground-Background Texture Discrimination , 2006, International Journal of Computer Vision.

[44]  Yannis Avrithis,et al.  To Aggregate or Not to aggregate: Selective Match Kernels for Image Search , 2013, 2013 IEEE International Conference on Computer Vision.

[45]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Jitendra Malik,et al.  Large Displacement Optical Flow: Descriptor Matching in Variational Motion Estimation , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Yann LeCun,et al.  Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[48]  Seunghoon Hong,et al.  Online Tracking by Learning Discriminative Saliency Map with Convolutional Neural Network , 2015, ICML.

[49]  Stan Sclaroff,et al.  MEEM: Robust Tracking via Multiple Experts Using Entropy Minimization , 2014, ECCV.

[50]  Ming-Hsuan Yang,et al.  Incremental Learning for Robust Visual Tracking , 2008, International Journal of Computer Vision.

[51]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[52]  Dit-Yan Yeung,et al.  Understanding and Diagnosing Visual Tracking Systems , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[53]  Cordelia Schmid,et al.  Hamming Embedding and Weak Geometric Consistency for Large Scale Image Search , 2008, ECCV.

[54]  Dorin Comaniciu,et al.  Real-time tracking of non-rigid objects using mean shift , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[55]  Yi Li,et al.  DeepTrack: Learning Discriminative Feature Representations by Convolutional Neural Networks for Visual Tracking , 2014, BMVC.

[56]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[57]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[58]  Nikos Komodakis,et al.  Learning to compare image patches via convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).