Saliency Inside: Learning Attentive CNNs for Content-Based Image Retrieval

In content-based image retrieval (CBIR), one of the most challenging and ambiguous tasks is to correctly understand the human query intention and measure its semantic relevance with images in the database. Due to the impressive capability of visual saliency in predicting human visual attention that is closely related to the query intention, this paper attempts to explicitly discover the essential effect of visual saliency in CBIR via qualitative and quantitative experiments. Toward this end, we first generate the fixation density maps of images from a widely used CBIR dataset by using an eye-tracking apparatus. These ground-truth saliency maps are then used to measure the influence of visual saliency to the task of CBIR by exploring several probable ways of incorporating such saliency cues into the retrieval process. We find that visual saliency is indeed beneficial to the CBIR task, and the best saliency involving scheme is possibly different for different image retrieval models. Inspired by the findings, this paper presents two-stream attentive convolutional neural networks (CNNs) with saliency embedded inside for CBIR. The proposed network has two streams that simultaneously handle two tasks. The main stream focuses on extracting discriminative visual features that are tightly related to semantic attributes. Meanwhile, the auxiliary stream aims to facilitate the main stream by redirecting the feature extraction to the salient image content that a human may pay attention to. By fusing these two streams into the Main and Auxiliary CNNs (MAC), image similarity can be computed as the human being does by reserving conspicuous content and suppressing irrelevant regions. Extensive experiments show that the proposed model achieves impressive performance in image retrieval on four public datasets.

[1]  Victor S. Lempitsky,et al.  Aggregating Deep Convolutional Features for Image Retrieval , 2015, ArXiv.

[2]  Hongxun Yao,et al.  Exploiting the complementary strengths of multi-layer CNN features for image retrieval , 2017, Neurocomputing.

[3]  Bingbing Ni,et al.  HCP: A Flexible CNN Framework for Multi-Label Image Classification , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Albert Gordo,et al.  Deep Image Retrieval: Learning Global Representations for Image Search , 2016, ECCV.

[5]  Jingdong Wang,et al.  Binary Optimized Hashing , 2016, ACM Multimedia.

[6]  Atsuto Maki,et al.  A Baseline for Visual Instance Retrieval with Deep Convolutional Networks , 2014, ICLR 2015.

[7]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[8]  Yuting Zhang,et al.  Sketch-Based Image Retrieval by Salient Contour Reinforcement , 2016, IEEE Transactions on Multimedia.

[9]  Tieniu Tan,et al.  Deep semantic ranking based hashing for multi-label image retrieval , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Noel E. O'Connor,et al.  Shallow and Deep Convolutional Networks for Saliency Prediction , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Daniel L Rubin,et al.  Content-based image retrieval in radiology: analysis of variability in human perception of similarity , 2015, Journal of medical imaging.

[12]  Victor S. Lempitsky,et al.  Neural Codes for Image Retrieval , 2014, ECCV.

[13]  Qi Tian,et al.  Good Practice in CNN Feature Transfer , 2016, ArXiv.

[14]  Cristhian Aguilera,et al.  Learning Cross-Spectral Similarity Measures with Deep Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[15]  Adrian G. Bors,et al.  Image retrieval based on query by saliency content , 2015, Digit. Signal Process..

[16]  Yann LeCun,et al.  Stereo Matching by Training a Convolutional Neural Network to Compare Image Patches , 2015, J. Mach. Learn. Res..

[17]  Wei Liu,et al.  Towards Large-Scale Histopathological Image Analysis: Hashing-Based Image Retrieval , 2015, IEEE Transactions on Medical Imaging.

[18]  Wei Liu,et al.  Supervised Discrete Hashing , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[20]  Cordelia Schmid,et al.  Convolutional Patch Representations for Image Retrieval: An Unsupervised Approach , 2016, International Journal of Computer Vision.

[21]  Cordelia Schmid,et al.  Hamming Embedding and Weak Geometric Consistency for Large Scale Image Search , 2008, ECCV.

[22]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[23]  Satrajit Acharya,et al.  Image retrieval based on visual attention model , 2012 .

[24]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[25]  Hanjiang Lai,et al.  Supervised Hashing for Image Retrieval via Image Representation Learning , 2014, AAAI.

[26]  Yao Zhao,et al.  Joint Optimization Toward Effective and Efficient Image Search , 2013, IEEE Transactions on Cybernetics.

[27]  Leonidas J. Guibas,et al.  Joint embeddings of shapes and images via CNN image purification , 2015, ACM Trans. Graph..

[28]  Tim K Marks,et al.  SUN: A Bayesian framework for saliency using natural statistics. , 2008, Journal of vision.

[29]  Xuan Zhang,et al.  Visual saliency based bag of phrases for image retrival , 2014, VRCAI '14.

[30]  Shih-Fu Chang,et al.  Image Retrieval: Current Techniques, Promising Directions, and Open Issues , 1999, J. Vis. Commun. Image Represent..

[31]  Krystian Mikolajczyk,et al.  PN-Net: Conjoined Triple Deep Network for Learning Local Image Descriptors , 2016, ArXiv.

[32]  Jian Sun,et al.  Collaborative Index Embedding for Image Retrieval , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Pietro Perona,et al.  Graph-Based Visual Saliency , 2006, NIPS.

[34]  Victor S. Lempitsky,et al.  Efficient Indexing of Billion-Scale Datasets of Deep Descriptors , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[37]  Naila Murray,et al.  Interferences in Match Kernels , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Michael G. Rabbat,et al.  Efficient Large-Scale Similarity Search Using Matrix Factorization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Wen-Hsiao Peng,et al.  Discriminatively-learned global image representation using CNN as a local feature extractor for image retrieval , 2015, 2015 Visual Communications and Image Processing (VCIP).

[40]  Shih-Fu Chang,et al.  Semi-supervised hashing for scalable image retrieval , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[41]  Yao Zhao,et al.  Object Region Mining with Adversarial Erasing: A Simple Classification to Semantic Segmentation Approach , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Nikos Komodakis,et al.  Learning to compare image patches via convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Yao Zhao,et al.  Improving the similarity estimation via score distribution , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[44]  Zhenfeng Zhu,et al.  Indexing of the CNN features for the large scale image search , 2018, Multimedia Tools and Applications.

[45]  Yao Zhao,et al.  Frame Fusion for Video Copy Detection , 2011, IEEE Transactions on Circuits and Systems for Video Technology.

[46]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[47]  Yunchao Wei,et al.  STC: A Simple to Complex Framework for Weakly-Supervised Semantic Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[49]  Ngai-Man Cheung,et al.  Selective Deep Convolutional Features for Image Retrieval , 2017, ACM Multimedia.

[50]  Pietro Perona,et al.  Is bottom-up attention useful for object recognition? , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[51]  Yao Zhao,et al.  Learning to segment with image-level annotations , 2016, Pattern Recognit..

[52]  Albert Gordo,et al.  End-to-End Learning of Deep Visual Representations for Image Retrieval , 2016, International Journal of Computer Vision.

[53]  Peiquan Jin,et al.  An approach for image retrieval based on visual saliency , 2009, 2009 International Conference on Image Analysis and Signal Processing.

[54]  Svetlana Lazebnik,et al.  Multi-scale Orderless Pooling of Deep Convolutional Activation Features , 2014, ECCV.

[55]  Simon Osindero,et al.  Cross-Dimensional Weighting for Aggregated Deep Convolutional Features , 2015, ECCV Workshops.

[56]  Jinhui Tang,et al.  Supervised Quantization for Similarity Search , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Emmanouil Giouvanakis,et al.  Saliency map driven image retrieval combining the bag-of-words model and PLSA , 2014, 2014 19th International Conference on Digital Signal Processing.

[58]  Rongrong Ji,et al.  Supervised hashing with kernels , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  Nicholas Ayache,et al.  Learning Semantic and Visual Similarity for Endomicroscopy Video Retrieval , 2012, IEEE Transactions on Medical Imaging.

[60]  Meng Wang,et al.  Neighborhood Discriminant Hashing for Large-Scale Image Retrieval , 2015, IEEE Transactions on Image Processing.

[61]  Yao Zhao,et al.  Two-stream Attentive CNNs for Image Retrieval , 2017, ACM Multimedia.

[62]  Qi Tian,et al.  Accurate Image Search with Multi-Scale Contextual Evidences , 2016, International Journal of Computer Vision.

[63]  Xue Li,et al.  A novel CNN-based match kernel for image retrieval , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[64]  Hanjiang Lai,et al.  Simultaneous feature learning and hash coding with deep neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Frédo Durand,et al.  Learning to predict where humans look , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[66]  Patrick Gros,et al.  Asymmetric hamming embedding: taking the best of our bits for large scale image search , 2011, ACM Multimedia.

[67]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Ondrej Chum,et al.  CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples , 2016, ECCV.

[69]  Qi Tian,et al.  SIFT Meets CNN: A Decade Survey of Instance Retrieval , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[70]  Jing-Yu Yang,et al.  Content-based image retrieval using computational visual attention model , 2015, Pattern Recognit..

[71]  Zhenkun Wen,et al.  Image Retrieval Based on Saliency Attention , 2014 .

[72]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[73]  Yaswanth Kumar Avulapati,et al.  MULTIMODAL FUSION FOR VIDEO SEARCH RERANKING , 2013 .

[74]  Michael Isard,et al.  Lost in quantization: Improving particular object retrieval in large scale image databases , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[75]  Benjamin W Tatler,et al.  The central fixation bias in scene viewing: selecting an optimal viewing position independently of motor biases and image feature distributions. , 2007, Journal of vision.

[76]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.