Drill-down: Interactive Retrieval of Complex Scenes using Natural Language Queries

This paper explores the task of interactive image retrieval using natural language queries, where a user progressively provides input queries to refine a set of retrieval results. Moreover, our work explores this problem in the context of complex image scenes containing multiple objects. We propose Drill-down, an effective framework for encoding multiple queries with an efficient compact state representation that significantly extends current methods for single-round image retrieval. We show that using multiple rounds of natural language queries as input can be surprisingly effective to find arbitrarily specific images of complex scenes. Furthermore, we find that existing image datasets with textual captions can provide a surprisingly effective form of weak supervision for this task. We compare our method with existing sequential encoding and embedding networks, demonstrating superior performance on two proposed benchmarks: automatic image retrieval on a simulated scenario that uses region captions as queries, and interactive image retrieval using real queries from human evaluators.

[1]  Yu-Gang Jiang,et al.  Learning Fashion Compatibility with Bidirectional LSTMs , 2017, ACM Multimedia.

[2]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[3]  Joelle Pineau,et al.  Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models , 2015, AAAI.

[4]  David A. Forsyth,et al.  Learning Type-Aware Embeddings for Fashion Compatibility , 2018, ECCV.

[5]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[6]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[7]  Stefan Lee,et al.  Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[8]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Vicente Ordonez,et al.  Chat-crowd: A Dialog-based Platform for Visual Layout Composition , 2018, NAACL.

[10]  Adriana Kovashka,et al.  Attribute Pivots for Guiding Relevance Feedback in Image Search , 2013, 2013 IEEE International Conference on Computer Vision.

[11]  Xi Chen,et al.  Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[12]  Wei-Ying Ma,et al.  Unified Visual-Semantic Embeddings: Bridging Vision and Language With Structured Meaning Representations , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Shih-Fu Chang,et al.  Image Retrieval: Current Techniques, Promising Directions, and Open Issues , 1999, J. Vis. Commun. Image Represent..

[14]  Jason Weston,et al.  Memory Networks , 2014, ICLR.

[15]  Andrew Zisserman,et al.  Multiple queries for large scale specific object retrieval , 2012, BMVC.

[16]  Rogério Schmidt Feris,et al.  Dialog-based Interactive Image Retrieval , 2018, NeurIPS.

[17]  Jakob Grue Simonsen,et al.  A Hierarchical Recurrent Encoder-Decoder for Generative Context-Aware Query Suggestion , 2015, CIKM.

[18]  Larry S. Davis,et al.  Image ranking and retrieval based on multi-attribute queries , 2011, CVPR 2011.

[19]  Ning-San Chang,et al.  A Relational Database System for Images , 1980, Pictorial Information Systems.

[20]  Armand Joulin,et al.  Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[21]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[22]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[23]  Yejin Choi,et al.  Globally Coherent Text Generation with Neural Checklist Models , 2016, EMNLP.

[24]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Saurabh Singh,et al.  Aligned Image-Word Representations Improve Inductive Transfer Across Vision-Language Tasks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[27]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[28]  José M. F. Moura,et al.  Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Gang Hua,et al.  Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[30]  Adriana Kovashka,et al.  WhittleSearch: Interactive Image Search with Relative Attribute Feedback , 2015, International Journal of Computer Vision.

[31]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Pavel Vácha,et al.  Query by Pictorial Example , 2011 .

[33]  Svetlana Lazebnik,et al.  Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[34]  Tat-Seng Chua,et al.  Knowledge-aware Multimodal Dialogue Systems , 2018, ACM Multimedia.

[35]  Yin Li,et al.  Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[37]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.