Telling the What while Pointing to the Where: Multimodal Queries for Image Retrieval

Most existing image retrieval systems use text queries as a way for the user to express what they are looking for. However, fine-grained image retrieval often requires the ability to also express where in the image the content they are looking for is. The text modality can only cumbersomely express such localization preferences, whereas pointing is a more natural fit. In this paper, we propose an image retrieval setup with a new form of multimodal queries, where the user simultaneously uses both spoken natural language (the what) and mouse traces over an empty canvas (the where) to express the characteristics of the desired target image. We then describe simple modifications to an existing image retrieval model, enabling it to operate in this setup. Qualitative and quantitative experiments show that our model effectively takes this spatial guidance into account, and provides significantly more accurate retrieval results compared to text-only equivalent systems.

[1]  Andrew W. Fitzgibbon,et al.  Efficient Object Category Recognition Using Classemes , 2010, ECCV.

[2]  Yang Song,et al.  Learning Fine-Grained Image Similarity with Deep Ranking , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Sergio Guadarrama,et al.  Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Andrea Esuli,et al.  Transformer Reasoning Network for Image- Text Matching and Retrieval , 2021, 2020 25th International Conference on Pattern Recognition (ICPR).

[5]  Lior Wolf,et al.  Associating neural word embeddings with deep image representations using Fisher Vectors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  H. H. Clark Coordinating with each other in a material world , 2005 .

[7]  Andrew Zisserman,et al.  Efficient On-the-fly Category Retrieval Using ConvNets and GPUs , 2014, ACCV.

[8]  Aviv Eisenschtat,et al.  Linking Image and Text with 2-Way Nets , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Xiaogang Wang,et al.  Scene Graph Generation from Objects, Phrases and Region Captions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[10]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[11]  Huchuan Lu,et al.  Similarity Reasoning and Filtration for Image-Text Matching , 2021, AAAI.

[12]  Jordi Pont-Tuset,et al.  Connecting Vision and Language with Localized Narratives , 2019, ECCV.

[13]  Kate Saenko,et al.  Language Features Matter: Effective Language Representations for Vision-Language Tasks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Hexiang Hu,et al.  Learning the Best Pooling Strategy for Visual Semantic Embedding , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Svetlana Lazebnik,et al.  Improving Image-Sentence Embeddings Using Large Weakly Annotated Photo Collections , 2014, ECCV.

[16]  Qi Tian,et al.  Recent Advance in Content-based Image Retrieval: A Literature Survey , 2017, ArXiv.

[17]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Danfei Xu,et al.  Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Yejin Choi,et al.  Neural Motifs: Scene Graph Parsing with Global Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Andrea Esuli,et al.  Fine-Grained Visual Textual Alignment for Cross-Modal Retrieval Using Transformer Encoders , 2020, ACM Trans. Multim. Comput. Commun. Appl..

[21]  Heng Tao Shen,et al.  Universal Weighting Metric Learning for Cross-Modal Matching , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Qi Tian,et al.  SIFT Meets CNN: A Decade Survey of Instance Retrieval , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Toshihiko Yamasaki,et al.  Efficient and Interactive Spatial-Semantic Image Retrieval , 2018, MMM.

[25]  B. S. Manjunath,et al.  Category-based image retrieval , 2001, Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205).

[26]  Lin Su,et al.  ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data , 2020, ArXiv.

[27]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[28]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[29]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[30]  Mert Kilickaya,et al.  Structured Visual Search via Composition-aware Learning , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[31]  Wei Chen,et al.  Deep Image Retrieval: A Survey , 2021, ArXiv.

[32]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[33]  Jiebo Luo,et al.  Adaptive Offline Quintuplet Loss for Image-Text Matching , 2020, ECCV.

[34]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[35]  Leo Sampaio Ferraz Ribeiro,et al.  Sketching out the details: Sketch-based image retrieval using convolutional neural networks with multi-stage regression , 2018, Comput. Graph..

[36]  Tao Xiang,et al.  Deep Spatial-Semantic Attention for Fine-Grained Sketch-Based Image Retrieval , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[37]  Huchuan Lu,et al.  Deep Cross-Modal Projection Learning for Image-Text Matching , 2018, ECCV.

[38]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[39]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[40]  Heiko Schuldt,et al.  Query by Semantic Sketch , 2019, ArXiv.

[41]  Xi Chen,et al.  Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[42]  Fei Sha,et al.  Learning to Represent Image and Text with Denotation Graphs , 2020, EMNLP.

[43]  B. Scholl,et al.  “Please Tap the Shape, Anywhere You Like” , 2014, Psychological science.

[44]  Li Fei-Fei,et al.  Composing Text and Image for Image Retrieval - an Empirical Odyssey , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[46]  Albert Gordo,et al.  Deep Image Retrieval: Learning Global Representations for Image Search , 2016, ECCV.

[47]  Michael S. Bernstein,et al.  Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Li Fei-Fei,et al.  Generating Semantically Precise Scene Graphs from Textual Descriptions for Improved Image Retrieval , 2015, VL@EMNLP.

[49]  Xiaogang Wang,et al.  CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[50]  Marcus Rohrbach,et al.  12-in-1: Multi-Task Vision and Language Representation Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Svetlana Lazebnik,et al.  Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[52]  Xilin Chen,et al.  Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[53]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[54]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Rogério Schmidt Feris,et al.  Dialog-based Interactive Image Retrieval , 2018, NeurIPS.

[56]  Radu Soricut,et al.  Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Bohyung Han,et al.  Large-Scale Image Retrieval with Attentive Deep Local Features , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[58]  Jean-Baptiste Alayrac,et al.  Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers , 2021, Transactions of the Association for Computational Linguistics.

[59]  Hailin Jin,et al.  Spatial-Semantic Image Search by Visual Feature Synthesis , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[61]  Yin Li,et al.  Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[63]  Lin Ma,et al.  Multimodal Convolutional Neural Networks for Matching Image and Sentence , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[64]  Nan Duan,et al.  Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.

[65]  Yun Fu,et al.  Visual Semantic Reasoning for Image-Text Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[66]  Shin'ichi Satoh,et al.  Region-Based Image Retrieval Revisited , 2017, ACM Multimedia.

[67]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[68]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[69]  Ling Shao,et al.  Deep Sketch Hashing: Fast Free-Hand Sketch-Based Image Retrieval , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[71]  Ji Liu,et al.  IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Shaogang Gong,et al.  Image Search With Text Feedback by Visiolinguistic Attention Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[73]  Bernt Schiele,et al.  Scalable Nonlinear Embeddings for Semantic Category-Based Image Retrieval , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[74]  Tiancheng Zhao,et al.  VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search , 2021, ArXiv.

[75]  Larry S. Davis,et al.  Generating Holistic 3D Scene Abstractions for Text-Based Image Retrieval , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[76]  Jung-Woo Ha,et al.  Dual Attention Networks for Multimodal Reasoning and Matching , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[77]  Ling Shao,et al.  Generative Domain-Migration Hashing for Sketch-to-Image Retrieval , 2018, ECCV.

[78]  Giorgos Tolias,et al.  Fine-Tuning CNN Image Retrieval with No Human Annotation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.