Compact Deep Invariant Descriptors for Video Retrieval

With emerging demand for large-scale video analysis, the Motion Picture Experts Group (MPEG) initiated the Compact Descriptor for Video Analysis (CDVA) standardization in 2014. In this work, we develop novel deep-learning features and incorporate them into the well-established CDVA evaluation framework to study its effectiveness in video analysis. In particular, we propose a Nested Invariance Pooling (NIP) method to obtain compact and robust Convolutional Neural Network (CNNs) descriptors. The CNNs descriptors are generated by applying three different pooling operations to the feature maps of CNNs in a nested way towards rotation and scale invariant feature representation. In particular, the rational, advantages and performance on the combination of CNNs and handcrafted descriptors are provided to better investigate the complementary effects of deep learnt and handcrafted features. Extensive experimental results show that the proposed CNNs descriptors outperform both state-of-the-art CNNs descriptors and canonical handcrafted descriptors adopted in CDVA Experimental Model (CXM) with significant mAP gains of 11.3% and 4.7%, respectively. Moreover, the combination of NIP derived deep invariant descriptors and handcrafted descriptors not only fulfills the lowest bitrate budget of CDVA, but also significantly advances the performance of CDVA core techniques.

[1]  Atsuto Maki,et al.  From generic to specific deep representations for visual recognition , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[2]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[3]  Wen Gao,et al.  Overview of the MPEG-CDVS Standard , 2015, IEEE Transactions on Image Processing.

[4]  Bernd Girod,et al.  Interframe Coding of Global Image Signatures for Mobile Augmented Reality , 2014, 2014 Data Compression Conference.

[5]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[6]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7]  Ronan Sicre,et al.  Particular object retrieval with integral max-pooling of CNN activations , 2015, ICLR.

[8]  Wen Gao,et al.  Weighted Component Hashing of Binary Aggregated Descriptors for Fast Visual Search , 2015, IEEE Transactions on Multimedia.

[9]  Joel Z. Leibo,et al.  Learning invariant representations and applications to face verification , 2013, NIPS.

[10]  Lorenzo Rosasco,et al.  A deep representation for invariance and music classification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Victor S. Lempitsky,et al.  Aggregating Local Deep Features for Image Retrieval , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  Jie Lin,et al.  A practical guide to CNNs and Fisher Vectors for image instance retrieval , 2015, Signal Process..

[13]  Atsuto Maki,et al.  A Baseline for Visual Instance Retrieval with Deep Convolutional Networks , 2014, ICLR 2015.

[14]  Tomaso Poggio,et al.  Representation Learning in Sensory Cortex: A Theory , 2014, IEEE Access.

[15]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[16]  David Stutz,et al.  Neural Codes for Image Retrieval , 2015 .

[17]  Wen Gao,et al.  Compact Descriptors for Visual Search , 2014, IEEE MultiMedia.

[18]  Jie Lin,et al.  DeepHash: Getting Regularization, Depth and Fine-Tuning Right , 2015, ArXiv.

[19]  Wen Gao,et al.  Affinity Preserving Quantization for Hashing: A Vector Quantization Approach to Compact Learn Binary Codes , 2016, AAAI.

[20]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.