Nested Invariance Pooling and RBM Hashing for Image Instance Retrieval

The goal of this work is the computation of very compact binary hashes for image instance retrieval. Our approach has two novel contributions. The first one is Nested Invariance Pooling (NIP), a method inspired from i-theory, a mathematical theory for computing group invariant transformations with feed-forward neural networks. NIP is able to produce compact and well-performing descriptors with visual representations extracted from convolutional neural networks. We specifically incorporate scale, translation and rotation invariances but the scheme can be extended to any arbitrary sets of transformations. We also show that using moments of increasing order throughout nesting is important. The NIP descriptors are then hashed to the target code size (32-256 bits) with a Restricted Boltzmann Machine with a novel batch-level regularization scheme specifically designed for the purpose of hashing (RBMH). A thorough empirical evaluation with state-of-the-art shows that the results obtained both with the NIP descriptors and the NIP+RBMH hashes are consistently outstanding across a wide range of datasets.

[1]  Atsuto Maki,et al.  From generic to specific deep representations for visual recognition , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[2]  Yannis Avrithis,et al.  To Aggregate or Not to aggregate: Selective Match Kernels for Image Search , 2013, 2013 IEEE International Conference on Computer Vision.

[3]  Bart Thomee,et al.  New trends and ideas in visual concept detection: the MIR flickr retrieval evaluation initiative , 2010, MIR '10.

[4]  Lorenzo Rosasco,et al.  A deep representation for invariance and music classification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Antonio Torralba,et al.  Spectral Hashing , 2008, NIPS.

[6]  Lorenzo Rosasco,et al.  Unsupervised learning of invariant representations with low sample complexity: the magic of sensory cortex or a new framework for machine learning? , 2014 .

[7]  Svetlana Lazebnik,et al.  Iterative quantization: A procrustean approach to learning binary codes , 2011, CVPR 2011.

[8]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[9]  Atsuto Maki,et al.  A Baseline for Visual Instance Retrieval with Deep Convolutional Networks , 2014, ICLR 2015.

[10]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[11]  Svetlana Lazebnik,et al.  Multi-scale Orderless Pooling of Deep Convolutional Activation Features , 2014, ECCV.

[12]  Geoffrey E. Hinton,et al.  3D Object Recognition with Deep Belief Nets , 2009, NIPS.

[13]  Christian Szegedy,et al.  DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[15]  Bernd Girod,et al.  Residual enhanced visual vector as a compact signature for mobile visual search , 2013, Signal Process..

[16]  Jingdong Wang,et al.  Composite Quantization for Approximate Nearest Neighbor Search , 2014, ICML.

[17]  Ernest Valveny,et al.  Leveraging category-level labels for instance-level image retrieval , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Florent Perronnin,et al.  Fisher vectors meet Neural Networks: A hybrid classification architecture , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Andrew Zisserman,et al.  Triangulation Embedding and Democratic Aggregation for Image Search , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Matthieu Cord,et al.  Unsupervised and Supervised Visual Codes with Restricted Boltzmann Machines , 2012, ECCV.

[21]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  David Stutz,et al.  Neural Codes for Image Retrieval , 2015 .

[23]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Ronan Sicre,et al.  Particular object retrieval with integral max-pooling of CNN activations , 2015, ICLR.

[25]  Jinhui Tang,et al.  Sparse composite quantization , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Cordelia Schmid,et al.  Hamming Embedding and Weak Geometric Consistency for Large Scale Image Search , 2008, ECCV.

[28]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[29]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[30]  Svetlana Lazebnik,et al.  Locality-sensitive binary codes from shift-invariant kernels , 2009, NIPS.

[31]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[32]  Florent Perronnin,et al.  Large-scale image retrieval with compressed Fisher vectors , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[33]  Joel Z. Leibo,et al.  Learning invariant representations and applications to face verification , 2013, NIPS.

[34]  Xiaogang Wang,et al.  Deep Learning Face Representation from Predicting 10,000 Classes , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[36]  Sanjiv Kumar,et al.  Learning Binary Codes for High-Dimensional Data Using Bilinear Projections , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Wen Gao,et al.  Robust fisher codes for large scale image retrieval , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[39]  Dima Damen,et al.  Recognizing linked events: Searching the space of feasible explanations , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Victor S. Lempitsky,et al.  Aggregating Local Deep Features for Image Retrieval , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[41]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[42]  Svetlana Lazebnik,et al.  Asymmetric Distances for Binary Embeddings , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[44]  Huizhong Chen,et al.  The stanford mobile visual search data set , 2011, MMSys.

[45]  Cordelia Schmid,et al.  International Conference on Computer Vision (ICCV 2017) , 2017 .

[46]  Tomaso Poggio,et al.  Representation Learning in Sensory Cortex: A Theory , 2014, IEEE Access.