Making Residual Vector Distribution Uniform for Distinctive Image Representation

Recently, image representation by vector of locally aggregated descriptors (VLADs) has been demonstrated to be super efficient in image representation. Due to the coarse division in the feature space, its discriminative power is limited. One intuitive way to address this issue is to construct a VLAD with a larger vocabulary, but this will lead to a higher dimensional VLAD and suffer more computational complexity when learning the principal component analysis parameters used to project VLAD onto a low-dimensional space. In this paper, we propose a hierarchical scheme to build the VLAD. In our approach, by generating some subwords to each visual word of a coarse vocabulary, a hidden layer visual vocabulary is constructed. With the hidden layer visual vocabulary, the feature space is divided finer. Then, we aggregate the residues in the hidden layer vocabulary to the coarse layer to obtain an image descriptor that is of the same dimension as the original VLAD. In addition, we reveal that performing the whitening operation to local descriptor can further enhance the discriminative power of the VLAD. We validate our approach with experiments mainly conducted on three benchmark data sets, i.e., Holidays data set, UKBench data set, and Oxford Building data set with Flickr1M as distractors and make comparison with the related algorithms on VLAD. The experimental results demonstrate the effectiveness of our algorithm.

[1]  Andrew Zisserman,et al.  Triangulation Embedding and Democratic Aggregation for Image Search , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  C. V. Jawahar,et al.  Blocks That Shout: Distinctive Parts for Scene Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Hervé Jégou,et al.  Negative Evidences and Co-occurences in Image Retrieval: The Benefit of PCA and Whitening , 2012, ECCV.

[4]  Guillaume Gravier,et al.  Sim-min-hash: an efficient matching technique for linking large image collections , 2013, ACM Multimedia.

[5]  Larry S. Davis,et al.  Representing Videos Using Mid-level Discriminative Patches , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Shengjin Wang,et al.  Visual Phraselet: Refining Spatial Constraints for Large Scale Image Search , 2013, IEEE Signal Processing Letters.

[8]  Michael Isard,et al.  Lost in quantization: Improving particular object retrieval in large scale image databases , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Patrick Pérez,et al.  Revisiting the VLAD image representation , 2013, ACM Multimedia.

[10]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Andrew Zisserman,et al.  Three things everyone should know to improve object retrieval , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Cordelia Schmid,et al.  Hamming Embedding and Weak Geometric Consistency for Large Scale Image Search , 2008, ECCV.

[13]  Hung-Khoon Tan,et al.  Real-Time Near-Duplicate Elimination for Web Video Search With Content and Context , 2009, IEEE Transactions on Multimedia.

[14]  Yannis Avrithis,et al.  To Aggregate or Not to aggregate: Selective Match Kernels for Image Search , 2013, 2013 IEEE International Conference on Computer Vision.

[15]  Qi Tian,et al.  Latent visual context learning for web image applications , 2011, Pattern Recognit..

[16]  Shiguang Shan,et al.  Semisupervised Hashing via Kernel Hyperplane Learning for Scalable Image Search , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[17]  Qi Tian,et al.  Towards Codebook-Free: Scalable Cascaded Hashing for Mobile Image Search , 2014, IEEE Transactions on Multimedia.

[18]  Qi Tian,et al.  Cross-Indexing of Binary SIFT Codes for Large-Scale Image Search , 2014, IEEE Transactions on Image Processing.

[19]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[20]  Qi Tian,et al.  Coupled Binary Embedding for Large-Scale Image Retrieval , 2014, IEEE Transactions on Image Processing.

[21]  Qi Tian,et al.  Contextual Hashing for Large-Scale Image Search , 2014, IEEE Transactions on Image Processing.

[22]  Hongyuan Zha,et al.  Inferring User Image-Search Goals Under the Implicit Guidance of Users , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[23]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[24]  Shiliang Zhang,et al.  Semantic-Aware Co-Indexing for Image Retrieval. , 2015, IEEE transactions on pattern analysis and machine intelligence.

[25]  Qi Tian,et al.  BSIFT: Toward Data-Independent Codebook for Large Scale Image Search , 2015, IEEE Transactions on Image Processing.

[26]  Jiri Matas,et al.  Efficient representation of local geometry for large scale object retrieval , 2009, CVPR.

[27]  Xuelong Li,et al.  An Efficient MRF Embedded Level Set Method for Image Segmentation , 2015, IEEE Transactions on Image Processing.

[28]  Byung Cheol Song,et al.  A fast multiresolution feature matching algorithm for exhaustive search in large image databases , 2001, IEEE Trans. Circuits Syst. Video Technol..

[29]  Cordelia Schmid,et al.  An Affine Invariant Interest Point Detector , 2002, ECCV.

[30]  Qi Tian,et al.  Spatial coding for large scale partial-duplicate web image search , 2010, ACM Multimedia.

[31]  Meng Wang,et al.  Movie2Comics: Towards a Lively Video Content Presentation , 2012, IEEE Transactions on Multimedia.

[32]  Qi Tian,et al.  Embedding spatial context information into inverted filefor large-scale image retrieval , 2012, ACM Multimedia.

[33]  Xuelong Li,et al.  Image Annotation by Multiple-Instance Learning With Discriminative Feature Mapping and Selection , 2014, IEEE Transactions on Cybernetics.

[34]  Kui Wu,et al.  A soft relevance framework in content-based image retrieval systems , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[35]  Shiliang Zhang,et al.  Edge-SIFT: Discriminative Binary Descriptor for Scalable Partial-Duplicate Mobile Search , 2013, IEEE Transactions on Image Processing.

[36]  Xuelong Li,et al.  Improving Level Set Method for Fast Auroral Oval Segmentation , 2014, IEEE Transactions on Image Processing.

[37]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[38]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[39]  Andrew Zisserman,et al.  All About VLAD , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Luc Van Gool,et al.  SURF: Speeded Up Robust Features , 2006, ECCV.

[41]  C. Schmid,et al.  On the burstiness of visual elements , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Qi Tian,et al.  Ieee Transactions on Image Processing Spatial Pooling of Heterogeneous Features for Image Classification , 2022 .

[43]  Qi Tian,et al.  Fast and accurate near-duplicate image search with affinity propagation on the ImageWeb , 2014, Comput. Vis. Image Underst..

[44]  Florent Perronnin,et al.  Large-scale image retrieval with compressed Fisher vectors , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.