Improving cross-dimensional weighting pooling with multi-scale feature fusion for image retrieval

Abstract In this paper, we aim to achieve effective image representation for image retrieval in an unsupervised manner. To this end, we propose a fully cross-dimensional weighting pooling (FCroW) method to improve the weight strategy of the cross-dimensional weighting pooling (CroW). More specifically, FCroW weights both the non-zero parts and zero-parts of convolutional layers, aiming to obtain robust image representations. In particular, we aggregate multi-scale features extracted by convolutional neural networks using the proposed FCroW, taking into account multiple aspects of visual features captured by the networks. Different weights can be assigned to the features extracted by different layers of the networks. To reduce the effort for parameter tuning, we propose an initial strategy to prune the searching space of the weights, which is achieved by designing constraint rules based on the prior knowledge on relations between the layers of the networks. Based on this, we propose weighted multi-layer feature fusion for similar image representations. Extensive experiments conducted on four public real-world datasets demonstrate the effectiveness of the proposed FCroW method and the pruning strategy for image retrieval.

[1]  C. Schmid,et al.  On the burstiness of visual elements , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Jiri Matas,et al.  Total recall II: Query expansion revisited , 2011, CVPR 2011.

[3]  Victor S. Lempitsky,et al.  Aggregating Deep Convolutional Features for Image Retrieval , 2015, ArXiv.

[4]  Andrew Zisserman,et al.  Three things everyone should know to improve object retrieval , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Cordelia Schmid,et al.  Improving Bag-of-Features for Large Scale Image Search , 2010, International Journal of Computer Vision.

[6]  Qi Tian,et al.  SIFT Meets CNN: A Decade Survey of Instance Retrieval , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Hervé Jégou,et al.  Visual query expansion with or without geometry: Refining local descriptors by feature aggregation , 2014, Pattern Recognit..

[8]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[9]  Jiri Matas,et al.  Learning a Fine Vocabulary , 2010, ECCV.

[10]  Dong Yang,et al.  Aurora image search with contextual CNN feature , 2017, Neurocomputing.

[11]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Yannis Avrithis,et al.  Towards large-scale geometry indexing by feature selection , 2014, Comput. Vis. Image Underst..

[13]  Tomás Pajdla,et al.  NetVLAD: CNN Architecture for Weakly Supervised Place Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[15]  Jiri Matas,et al.  Robust wide-baseline stereo from maximally stable extremal regions , 2004, Image Vis. Comput..

[16]  Bastian Leibe,et al.  Discovering Details and Scene Structure with Hierarchical Iconoid Shift , 2013, 2013 IEEE International Conference on Computer Vision.

[17]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[19]  Ronan Sicre,et al.  Particular object retrieval with integral max-pooling of CNN activations , 2015, ICLR.

[20]  Panu Turcot,et al.  Better matching with fewer features: The selection of useful features in large database recognition problems , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[21]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Kai Xu,et al.  Beauty Product Image Retrieval Based on Multi-Feature Fusion and Feature Aggregation , 2018, ACM Multimedia.

[23]  Tiejun Huang,et al.  Deep Relative Distance Learning: Tell the Difference between Similar Vehicles , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[25]  Victor S. Lempitsky,et al.  Aggregating Local Deep Features for Image Retrieval , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  Chee Seng Chan,et al.  Unprecedented Usage of Pre-trained CNNs on Beauty Product , 2018, ACM Multimedia.

[27]  Jingyan Wang,et al.  Image retrieval system based on multi-feature fusion and relevance feedback , 2010, 2010 International Conference on Machine Learning and Cybernetics.

[28]  Xi Zhang,et al.  Feature integration analysis of bag-of-features model for image retrieval , 2013, Neurocomputing.

[29]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Michael Isard,et al.  Lost in quantization: Improving particular object retrieval in large scale image databases , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Zhenguo Yang,et al.  Regional Maximum Activations of Convolutions with Attention for Cross-domain Beauty and Personal Care Product Retrieval , 2018, ACM Multimedia.

[32]  Jiri Matas,et al.  Large-Scale Discovery of Spatially Related Images , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Giorgos Tolias,et al.  Fine-Tuning CNN Image Retrieval with No Human Annotation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[35]  Michael Isard,et al.  Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[36]  Atsuto Maki,et al.  From generic to specific deep representations for visual recognition , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[37]  Cordelia Schmid,et al.  Scale & Affine Invariant Interest Point Detectors , 2004, International Journal of Computer Vision.

[38]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[39]  Svetlana Lazebnik,et al.  Multi-scale Orderless Pooling of Deep Convolutional Activation Features , 2014, ECCV.

[40]  Simon Osindero,et al.  Cross-Dimensional Weighting for Aggregated Deep Convolutional Features , 2015, ECCV Workshops.

[41]  Yannis Avrithis,et al.  Hough Pyramid Matching: Speeded-Up Geometry Re-ranking for Large Scale Image Retrieval , 2014, International Journal of Computer Vision.

[42]  Cordelia Schmid,et al.  Hamming Embedding and Weak Geometric Consistency for Large Scale Image Search , 2008, ECCV.

[43]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[44]  Andrew Zisserman,et al.  Geometric Latent Dirichlet Allocation on a Matching Graph for Large-scale Image Datasets , 2011, International Journal of Computer Vision.

[45]  Atsuto Maki,et al.  Visual Instance Retrieval with Deep Convolutional Networks , 2014, ICLR.

[46]  Xiaogang Wang,et al.  DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.