Towards Optimal CNN Descriptors for Large-Scale Image Retrieval

Instance-level image retrieval is a long-standing and challenging problem in multimedia. Recently, fine-tuning Convolutional Neural Networks (CNNs) has become a promising direction, and a number of successful strategies based on global CNN descriptors have been proposed. However, it is difficult to make direct comparisons and draw conclusions due to different settings and/or datasets. The goal of this paper is two-fold. Firstly, we present a unified implementation of modern global-CNN-based retrieval systems, break such a system into six major components, and investigate each part individually as well as globally when considering different configurations. We conduct a systematic series of experiments on a component-by-component basis and find an optimal solution in designing such a system. Secondly, we introduce a novel joint loss function with learnable parameter for fine-tuning for retrieval tasks and show, with extensive experiments, significant improvement over previous works. On the new and challenging large-scale Google-Landmarks-Dataset, we set a baseline for future research and comparisons, while on traditional retrieval benchmarks such as Oxford5k and Paris6k, as well as their recent revised versions ROxford5k and RParis6k, we achieve state-of-the-art performance under all three (Easy, Medium, and Hard) evaluation protocals by a large margin compared to competing methods.

[1]  Yannis Avrithis,et al.  Efficient Diffusion on Region Manifolds: Recovering Small Objects with Compact CNN Representations , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  David Stutz,et al.  Neural Codes for Image Retrieval , 2015 .

[3]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Michael Isard,et al.  Lost in quantization: Improving particular object retrieval in large scale image databases , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Yannis Avrithis,et al.  Hybrid Diffusion: Spectral-Temporal Graph Filtering for Manifold Ranking , 2018, ACCV.

[6]  Hervé Jégou,et al.  Visual query expansion with or without geometry: Refining local descriptors by feature aggregation , 2014, Pattern Recognit..

[7]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[8]  Giorgos Tolias,et al.  Fine-Tuning CNN Image Retrieval with No Human Annotation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Yannis Avrithis,et al.  Fast Spectral Ranking for Similarity Search , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Ronan Sicre,et al.  Particular object retrieval with integral max-pooling of CNN activations , 2015, ICLR.

[11]  Panu Turcot,et al.  Better matching with fewer features: The selection of useful features in large database recognition problems , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[12]  Bohyung Han,et al.  Large-Scale Image Retrieval with Attentive Deep Local Features , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[13]  Albert Gordo,et al.  Deep Image Retrieval: Learning Global Representations for Image Search , 2016, ECCV.

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Jiri Matas,et al.  Efficient representation of local geometry for large scale object retrieval , 2009, CVPR.

[16]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[17]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Yannis Avrithis,et al.  Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[22]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[23]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[24]  Ondrej Chum,et al.  CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples , 2016, ECCV.

[25]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[26]  Yinzheng Gu,et al.  Attention-aware Generalized Mean Pooling for Image Retrieval , 2018, ArXiv.

[27]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[29]  Michael Isard,et al.  Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[30]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[32]  Victor S. Lempitsky,et al.  Aggregating Deep Convolutional Features for Image Retrieval , 2015, ArXiv.

[33]  Andrew Zisserman,et al.  Three things everyone should know to improve object retrieval , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[35]  Florent Perronnin,et al.  Large-scale image retrieval with compressed Fisher vectors , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[36]  Albert Gordo,et al.  End-to-End Learning of Deep Visual Representations for Image Retrieval , 2016, International Journal of Computer Vision.

[37]  Svetlana Lazebnik,et al.  Multi-scale Orderless Pooling of Deep Convolutional Activation Features , 2014, ECCV.

[38]  Simon Osindero,et al.  Cross-Dimensional Weighting for Aggregated Deep Convolutional Features , 2015, ECCV Workshops.

[39]  Hervé Jégou,et al.  Negative Evidences and Co-occurences in Image Retrieval: The Benefit of PCA and Whitening , 2012, ECCV.