HNIP: Compact Deep Invariant Representations for Video Matching, Localization, and Retrieval

With emerging demand for large-scale video analysis, MPEG initiated the compact descriptor for video analysis (CDVA) standardization in 2014. Beyond handcrafted descriptors adopted by the current MPEG-CDVA reference model, we study the problem of deep learned global descriptors for video matching, localization, and retrieval. First, inspired by a recent invariance theory, we propose a nested invariance pooling (NIP) method to derive compact deep global descriptors from convolutional neural networks (CNNs), by progressively encoding translation, scale, and rotation invariances into the pooled descriptors. Second, our empirical studies have shown that a sequence of well designed pooling moments (e.g., max or average) may drastically impact video matching performance, which motivates us to design hybrid pooling operations via NIP (HNIP). HNIP has further improved the discriminability of deep global descriptors. Third, the technical merits and performance improvements by combining deep and handcrafted descriptors are provided to better investigate the complementary effects. We evaluate the effectiveness of HNIP within the well-established MPEG-CDVA evaluation framework. The extensive experiments have demonstrated that HNIP outperforms the state-of-the-art deep and canonical handcrafted descriptors with significant mAP gains of 5.5% and 4.7%, respectively. In particular the combination of HNIP incorporated CNN descriptors and handcrafted global descriptors has significantly boosted the performance of CDVA core techniques with comparable descriptor size.

[1]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[2]  Atsuto Maki,et al.  A Baseline for Visual Instance Retrieval with Deep Convolutional Networks , 2014, ICLR 2015.

[3]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[4]  Luc Van Gool,et al.  SURF: Speeded Up Robust Features , 2006, ECCV.

[5]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[6]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Roland Siegwart,et al.  BRISK: Binary Robust invariant scalable keypoints , 2011, 2011 International Conference on Computer Vision.

[8]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[9]  Miroslaw Bober,et al.  Improving Large-Scale Image Retrieval Through Robust Aggregation of Local Descriptors , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Wen Gao,et al.  Learning Compact Visual Descriptor for Low Bit Rate Mobile Landmark Search , 2011, IJCAI.

[11]  Andrea Vedaldi,et al.  Understanding Image Representations by Measuring Their Equivariance and Equivalence , 2014, International Journal of Computer Vision.

[12]  Andrew Zisserman,et al.  Triangulation Embedding and Democratic Aggregation for Image Search , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Albert Gordo,et al.  Deep Image Retrieval: Learning Global Representations for Image Search , 2016, ECCV.

[14]  Antonio Torralba,et al.  Spectral Hashing , 2008, NIPS.

[15]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[16]  Florent Perronnin,et al.  Large-scale image retrieval with compressed Fisher vectors , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[17]  Joel Z. Leibo,et al.  Learning invariant representations and applications to face verification , 2013, NIPS.

[18]  Subhransu Maji,et al.  Deep filter banks for texture recognition and segmentation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Arnold W. M. Smeulders,et al.  Real-Time Visual Concept Classification , 2010, IEEE Transactions on Multimedia.

[20]  Lorenzo Rosasco,et al.  A deep representation for invariance and music classification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Victor S. Lempitsky,et al.  Aggregating Local Deep Features for Image Retrieval , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[22]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[23]  Bernd Girod,et al.  Transform coding of image feature descriptors , 2009, Electronic Imaging.

[24]  Cordelia Schmid,et al.  Hamming Embedding and Weak Geometric Consistency for Large Scale Image Search , 2008, ECCV.

[25]  João Ascenso,et al.  Coding binary local features extracted from video sequences , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[26]  Ling-yu Duan,et al.  Rate-adaptive Compact Fisher Codes for Mobile Visual Search , 2014, IEEE Signal Processing Letters.

[27]  Svetlana Lazebnik,et al.  Locality-sensitive binary codes from shift-invariant kernels , 2009, NIPS.

[28]  Hervé Glotin,et al.  IRIM at TRECVID 2014: Semantic Indexing and Instance Search , 2014, TRECVID.

[29]  Stefano Tubaro,et al.  Coding Local and Global Binary Visual Features Extracted From Video Sequences , 2015, IEEE Transactions on Image Processing.

[30]  Bernd Girod,et al.  Interframe Coding of Feature Descriptors for Mobile Augmented Reality , 2014, IEEE Transactions on Image Processing.

[31]  Eckehard G. Steinbach,et al.  Keypoint Encoding for Improved Feature Extraction From Compressed Video at Low Bitrates , 2015, IEEE Transactions on Multimedia.

[32]  Bernd Girod,et al.  CHoG: Compressed histogram of gradients A low bit-rate feature descriptor , 2009, CVPR.

[33]  Josef Sivic,et al.  NetVLAD: CNN Architecture for Weakly Supervised Place Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Bernd Girod,et al.  A Hybrid Mobile Visual Search System With Compact Global Signatures , 2015, IEEE Transactions on Multimedia.

[35]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Atsuto Maki,et al.  From generic to specific deep representations for visual recognition , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[37]  Wen Gao,et al.  Compact Descriptors for Visual Search , 2014, IEEE MultiMedia.

[38]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Svetlana Lazebnik,et al.  Multi-scale Orderless Pooling of Deep Convolutional Activation Features , 2014, ECCV.

[40]  Simon Osindero,et al.  Cross-Dimensional Weighting for Aggregated Deep Convolutional Features , 2015, ECCV Workshops.

[41]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[42]  Tomaso Poggio,et al.  Representation Learning in Sensory Cortex: A Theory , 2014, IEEE Access.

[43]  Bernd Girod,et al.  Tree Histogram Coding for Mobile Image Matching , 2009, 2009 Data Compression Conference.

[44]  Bernd Girod,et al.  Temporal aggregation for large-scale query-by-image video retrieval , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[45]  Gary R. Bradski,et al.  ORB: An efficient alternative to SIFT or SURF , 2011, 2011 International Conference on Computer Vision.

[46]  Ondrej Chum,et al.  CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples , 2016, ECCV.

[47]  Kristen Grauman,et al.  Kernelized locality-sensitive hashing for scalable image search , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[48]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[49]  Vincent Lepetit,et al.  BRIEF: Computing a Local Binary Descriptor Very Fast , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Marco Tagliasacchi,et al.  Compress-then-analyze vs. analyze-then-compress: Two paradigms for image analysis in visual sensor networks , 2013, 2013 IEEE 15th International Workshop on Multimedia Signal Processing (MMSP).

[51]  Ling-Yu Duan,et al.  Compact Descriptors for Video Analysis: The Emerging MPEG Standard , 2017, IEEE MultiMedia.

[52]  Bernd Girod,et al.  Interframe Coding of Global Image Signatures for Mobile Augmented Reality , 2014, 2014 Data Compression Conference.

[53]  Victor S. Lempitsky,et al.  Neural Codes for Image Retrieval , 2014, ECCV.

[54]  Yi Yang,et al.  A discriminative CNN video representation for event detection , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Stefano Tubaro,et al.  Coding Visual Features Extracted From Video Sequences , 2014, IEEE Transactions on Image Processing.

[56]  Shiliang Zhang,et al.  USB: Ultrashort Binary Descriptor for Fast Visual Matching and Retrieval , 2014, IEEE Transactions on Image Processing.

[57]  Hervé Jégou,et al.  A Group Testing Framework for Similarity Search in High-dimensional Spaces , 2014, ACM Multimedia.

[58]  Bernd Girod,et al.  Residual enhanced visual vector as a compact signature for mobile visual search , 2013, Signal Process..

[59]  Ronan Sicre,et al.  Particular object retrieval with integral max-pooling of CNN activations , 2015, ICLR.

[60]  Bernd Girod,et al.  Mobile Visual Search , 2011, IEEE Signal Processing Magazine.