DeepTrack: Learning Discriminative Feature Representations by Convolutional Neural Networks for Visual Tracking

Defining hand-crafted feature representations needs expert knowledge, requires timeconsuming manual adjustments, and besides, it is arguably one of the limiting factors of object tracking. In this paper, we propose a novel solution to automatically relearn the most useful feature representations during the tracking process in order to accurately adapt appearance changes, pose and scale variations while preventing from drift and tracking failures. We employ a candidate pool of multiple Convolutional Neural Networks (CNNs) as a data-driven model of different instances of the target object. Individually, each CNN maintains a specific set of kernels that favourably discriminate object patches from their surrounding background using all available low-level cues. These kernels are updated in an online manner at each frame after being trained with just one instance at the initialization of the corresponding CNN. Given a frame, the most promising CNNs in the pool are selected to evaluate the hypothesises for the target object. The hypothesis with the highest score is assigned as the current detection window and the selected models are retrained using a warm-start back-propagation which optimizes a structural loss function. In addition to the model-free tracker, we introduce a class-specific version of the proposed method that is tailored for tracking of a particular object class such as human faces. Our experiments on a large selection of videos from the recent benchmarks demonstrate that our method outperforms the existing state-of-the-art algorithms and rarely loses the track of the target object.

[1]  Dit-Yan Yeung,et al.  Learning a Deep Compact Image Representation for Visual Tracking , 2013, NIPS.

[2]  Patrick Pérez,et al.  Color-Based Probabilistic Tracking , 2002, ECCV.

[3]  Huchuan Lu,et al.  Robust object tracking via sparsity-based collaborative model , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Yihong Gong,et al.  Human Tracking Using Convolutional Neural Networks , 2010, IEEE Transactions on Neural Networks.

[5]  Jürgen Schmidhuber,et al.  Multi-column deep neural network for traffic sign classification , 2012, Neural Networks.

[6]  Gérard G. Medioni,et al.  Context tracker: Exploring supporters and distracters in unconstrained environments , 2011, CVPR 2011.

[7]  Ming-Hsuan Yang,et al.  Visual tracking with online Multiple Instance Learning , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Erik Learned-Miller,et al.  FDDB: A benchmark for face detection in unconstrained settings , 2010 .

[9]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[10]  Yi Wu,et al.  Online Object Tracking: A Benchmark , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Huchuan Lu,et al.  Visual tracking via adaptive structural local sparse appearance model , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[13]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Ehud Rivlin,et al.  Robust Fragments-based Tracking using the Integral Histogram , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[15]  Yanxi Liu,et al.  Online selection of discriminative tracking features , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Jiri Matas,et al.  P-N learning: Bootstrapping binary classifiers by structural constraints , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[17]  Ming-Hsuan Yang,et al.  Incremental Learning for Robust Visual Tracking , 2008, International Journal of Computer Vision.

[18]  Y-Lan Boureau,et al.  Learning Convolutional Feature Hierarchies for Visual Recognition , 2010, NIPS.

[19]  Jiri Matas,et al.  Long-Term Tracking through Failure Cases , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[20]  Jürgen Schmidhuber,et al.  Multi-column deep neural networks for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.