Content-based Image Retrieval and the Semantic Gap in the Deep Learning Era

Content-based image retrieval has seen astonishing progress over the past decade, especially for the task of retrieving images of the same object that is depicted in the query image. This scenario is called instance or object retrieval and requires matching fine-grained visual patterns between images. Semantics, however, do not play a crucial role. This brings rise to the question: Do the recent advances in instance retrieval transfer to more generic image retrieval scenarios? To answer this question, we first provide a brief overview of the most relevant milestones of instance retrieval. We then apply them to a semantic image retrieval task and find that they perform inferior to much less sophisticated and more generic methods in a setting that requires image understanding. Following this, we review existing approaches to closing this so-called semantic gap by integrating prior world knowledge. We conclude that the key problem for the further advancement of semantic image retrieval lies in the lack of a standardized task definition and an appropriate benchmark dataset.

[1]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[2]  Hongxun Yao,et al.  Exploiting the complementary strengths of multi-layer CNN features for image retrieval , 2017, Neurocomputing.

[3]  Sugato Basu,et al.  HUSE: Hierarchical Universal Semantic Embeddings , 2019, ArXiv.

[4]  Heng Tao Shen,et al.  Searching for Actions on the Hyperbole , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[6]  Andrew Zisserman,et al.  Triangulation Embedding and Democratic Aggregation for Image Search , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Bohyung Han,et al.  Large-Scale Image Retrieval with Attentive Deep Local Features , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[8]  Ling-Yu Duan,et al.  Two-stage pooling of deep convolutional features for image retrieval , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[9]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[10]  Andrew Zisserman,et al.  Smooth-AP: Smoothing the Path Towards Large-Scale Image Retrieval , 2020, ECCV.

[11]  Tom E. Bishop,et al.  SHREWD: Semantic Hierarchy-based Relational Embeddings for Weakly-supervised Deep Hashing , 2019, ArXiv.

[12]  Florent Perronnin,et al.  Large-scale image retrieval with compressed Fisher vectors , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[13]  Joachim Denzler,et al.  Hierarchy-Based Image Embeddings for Semantic Image Retrieval , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[14]  Andrew Zisserman,et al.  Three things everyone should know to improve object retrieval , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Ingemar J. Cox,et al.  The Bayesian image retrieval system, PicHunter: theory, implementation, and psychophysical experiments , 2000, IEEE Trans. Image Process..

[16]  Toshikazu Kato,et al.  A sketch retrieval method for full color image database-query by visual example , 1992, [1992] Proceedings. 11th IAPR International Conference on Pattern Recognition.

[17]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[18]  Miroslaw Bober,et al.  Improving Large-Scale Image Retrieval Through Robust Aggregation of Local Descriptors , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Christos Faloutsos,et al.  QBIC project: querying images by content, using color, texture, and shape , 1993, Electronic Imaging.

[20]  Wei-Ying Ma,et al.  Unified Visual-Semantic Embeddings: Bridging Vision and Language With Structured Meaning Representations , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Thomas S. Huang,et al.  Relevance feedback in image retrieval: A comprehensive review , 2003, Multimedia Systems.

[22]  Jon Almazán,et al.  Learning With Average Precision: Training Image Retrieval With a Listwise Loss , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Joachim Denzler,et al.  Information-Theoretic Active Learning for Content-Based Image Retrieval , 2018, GCPR.

[24]  Miroslaw Bober,et al.  REMAP: Multi-Layer Entropy-Guided Pooling of Dense CNN Features for Image Retrieval , 2019, IEEE Transactions on Image Processing.

[25]  Xi Chen,et al.  Web-Scale Responsive Visual Search at Bing , 2018, KDD.

[26]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  David Stutz,et al.  Neural Codes for Image Retrieval , 2015 .

[28]  P. J. Narayanan,et al.  Unsupervised Image Style Embeddings for Retrieval and Recognition Tasks , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[29]  Albert Gordo,et al.  End-to-End Learning of Deep Visual Representations for Image Retrieval , 2016, International Journal of Computer Vision.

[30]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Michael Isard,et al.  Lost in quantization: Improving particular object retrieval in large scale image databases , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Volker Blanz,et al.  Color Composition Similarity and Its Application in Fine-grained Similarity , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[33]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[34]  Tao Mei,et al.  Adaptive Semantic-Visual Tree for Hierarchical Embeddings , 2019, ACM Multimedia.

[35]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Iasonas Kokkinos,et al.  MultiGrain: a unified image embedding for classes and instances , 2019, ArXiv.

[37]  Dimosthenis Karatzas,et al.  Learning to Learn from Web Data through Deep Semantic Embeddings , 2018, ECCV Workshops.

[38]  Joachim Denzler,et al.  Automatic Query Image Disambiguation for Content-based Image Retrieval , 2018, VISIGRAPP.

[39]  Julian Martin Eisenschlos,et al.  SoftSort: A Continuous Relaxation for the argsort Operator , 2020, ICML.

[40]  Bingyi Cao,et al.  Unifying Deep Local and Global Features for Image Search , 2020, ECCV.

[41]  Victor S. Lempitsky,et al.  Aggregating Local Deep Features for Image Retrieval , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[42]  Cordelia Schmid,et al.  Hamming Embedding and Weak Geometric Consistency for Large Scale Image Search , 2008, ECCV.

[43]  Giorgos Tolias,et al.  Fine-Tuning CNN Image Retrieval with No Human Annotation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[45]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[46]  Mark J. Huiskes,et al.  The MIR flickr retrieval evaluation , 2008, MIR '08.

[47]  Fei-Fei Li,et al.  Hierarchical semantic indexing for large scale image retrieval , 2011, CVPR 2011.

[48]  Samy Bengio,et al.  Large Scale Online Learning of Image Similarity Through Ranking , 2009, J. Mach. Learn. Res..

[49]  Cordelia Schmid,et al.  Scale & Affine Invariant Interest Point Detectors , 2004, International Journal of Computer Vision.

[50]  Hermann Ney,et al.  Learning weighted distances for relevance feedback in image retrieval , 2008, 2008 19th International Conference on Pattern Recognition.

[51]  Atsuto Maki,et al.  Visual Instance Retrieval with Deep Convolutional Networks , 2014, ICLR.

[52]  Yannis Avrithis,et al.  Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[53]  Ronan Sicre,et al.  Particular object retrieval with integral max-pooling of CNN activations , 2015, ICLR.

[54]  Yan Lu,et al.  Local Descriptors Optimized for Average Precision , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.