FhVLAD: Fine-grained quantization and encoding high-order descriptor statistics for scalable image retrieval

We are interested in the encoding of local descriptors of an image (e.g. SIFT) to design a compact representation vector and thereby address scalable image retrieval. We revisit the implicit design choices in the popular vector of locally aggregated descriptors (VLAD), which aggregates the residuals of descriptors to the codewords. VLAD’s use of a coarse codebook and first-order descriptor statistics in residual computation results in less discriminative residuals. To address this problem, we propose a division of codebook feature space using a novel fine-grained quantization strategy. After quantization, we embed the resulting residuals with high-order statistics of descriptor distribution. Experiments on three challenging image retrieval datasets (INRIA Holidays, UKBench, Oxford 5k) confirm the improved discriminative power of our novel encoding method called FhVLAD. We observe superior accuracy to baseline and competitive performance to state-of-the-art techniques with a limited increase in dimension.

[1]  Victor S. Lempitsky,et al.  Neural Codes for Image Retrieval , 2014, ECCV.

[2]  Cordelia Schmid,et al.  Improving Bag-of-Features for Large Scale Image Search , 2010, International Journal of Computer Vision.

[3]  Hongxun Yao,et al.  Exploiting the complementary strengths of multi-layer CNN features for image retrieval , 2017, Neurocomputing.

[4]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[5]  Rama Chellappa,et al.  VLAD encoded Deep Convolutional features for unconstrained face verification , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[6]  Victor S. Lempitsky,et al.  Aggregating Local Deep Features for Image Retrieval , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[7]  Andrew Zisserman,et al.  All About VLAD , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Thomas S. Huang,et al.  Image Classification Using Super-Vector Coding of Local Image Descriptors , 2010, ECCV.

[10]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[11]  Andrew Zisserman,et al.  Triangulation Embedding and Democratic Aggregation for Image Search , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Qi Tian,et al.  Fine-residual VLAD for image retrieval , 2016, Neurocomputing.

[13]  Hongtao Lu,et al.  Image classification based on improved VLAD , 2015, Multimedia Tools and Applications.

[14]  Florent Perronnin,et al.  Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Xiu-Shen Wei,et al.  Selective Convolutional Descriptor Aggregation for Fine-Grained Image Retrieval , 2016, IEEE Transactions on Image Processing.

[16]  Junqing Yu,et al.  A multi-level descriptor using ultra-deep feature for image retrieval , 2019, Multimedia Tools and Applications.

[17]  Qi Tian,et al.  Making Residual Vector Distribution Uniform for Distinctive Image Representation , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[18]  Hervé Jégou,et al.  Negative Evidences and Co-occurences in Image Retrieval: The Benefit of PCA and Whitening , 2012, ECCV.

[19]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Cordelia Schmid,et al.  Local Convolutional Features with Unsupervised Training for Image Retrieval , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[21]  Michael Isard,et al.  Lost in quantization: Improving particular object retrieval in large scale image databases , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Ying Wu,et al.  Spatially-Constrained Similarity Measurefor Large-Scale Object Retrieval , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Patrick Pérez,et al.  Revisiting the VLAD image representation , 2013, ACM Multimedia.

[24]  Nicu Sebe,et al.  A modified vector of locally aggregated descriptors approach for fast video classification , 2016, Multimedia Tools and Applications.

[25]  Cordelia Schmid,et al.  Convolutional Kernel Networks , 2014, NIPS.

[26]  Atsuto Maki,et al.  Visual Instance Retrieval with Deep Convolutional Networks , 2014, ICLR.

[27]  Anastasios Tefas,et al.  Deep convolutional learning for Content Based Image Retrieval , 2018, Neurocomputing.

[28]  Shyamanta M. Hazarika,et al.  Encoding High-Order Statistics in VLAD for Scalable Image Retrieval , 2019, PReMI.

[29]  Ronan Sicre,et al.  Particular object retrieval with integral max-pooling of CNN activations , 2015, ICLR.

[30]  Ling-Yu Duan,et al.  Hierarchical multi-VLAD for image retrieval , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[31]  Andrew Zisserman,et al.  Three things everyone should know to improve object retrieval , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Cordelia Schmid,et al.  Scale & Affine Invariant Interest Point Detectors , 2004, International Journal of Computer Vision.

[33]  Cheng Wang,et al.  Distribution Entropy Boosted VLAD for Image Retrieval , 2016, Entropy.

[34]  Qing Li,et al.  Multiple VLAD Encoding of CNNs for Image Classification , 2017, Computing in Science & Engineering.

[35]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[36]  Bohyung Han,et al.  Large-Scale Image Retrieval with Attentive Deep Local Features , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[37]  Ke Zhang,et al.  A hierarchical recurrent approach to predict scene graphs from a visual‐attention‐oriented perspective , 2019, Comput. Intell..

[38]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[39]  Dongming Zhang,et al.  Spatial pyramid VLAD , 2014, 2014 IEEE Visual Communications and Image Processing Conference.

[40]  Limin Wang,et al.  Boosting VLAD with Supervised Dictionary Learning and High-Order Statistics , 2014, ECCV.

[41]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[42]  Larry S. Davis,et al.  Exploiting local features from deep networks for image retrieval , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[43]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[44]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[45]  Torsten Sattler,et al.  Large-Scale Location Recognition and the Geometric Burstiness Problem , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Lei Wang,et al.  In defense of soft-assignment coding , 2011, 2011 International Conference on Computer Vision.

[47]  Guillaume Gravier,et al.  Oriented pooling for dense and non-dense rotation-invariant features , 2013, BMVC.

[48]  Miroslaw Bober,et al.  Improving Large-Scale Image Retrieval Through Robust Aggregation of Local Descriptors , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  Luc Van Gool,et al.  SURF: Speeded Up Robust Features , 2006, ECCV.

[50]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[51]  Christian Eggert,et al.  Improving VLAD: Hierarchical coding and a refined local coordinate system , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[52]  K. Balanda,et al.  Kurtosis: A Critical Review , 1988 .

[53]  Zhuang Miao,et al.  Adding spatial distribution clue to aggregated vector in image retrieval , 2018, EURASIP Journal on Image and Video Processing.

[54]  Cordelia Schmid,et al.  Hamming Embedding and Weak Geometric Consistency for Large Scale Image Search , 2008, ECCV.

[55]  Florent Perronnin,et al.  Large-scale image retrieval with compressed Fisher vectors , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[56]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[57]  Svetlana Lazebnik,et al.  Multi-scale Orderless Pooling of Deep Convolutional Activation Features , 2014, ECCV.

[58]  Josef Sivic,et al.  NetVLAD: CNN Architecture for Weakly Supervised Place Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[60]  Yihong Gong,et al.  Linear spatial pyramid matching using sparse coding for image classification , 2009, CVPR.