OVIS: Open-Vocabulary Visual Instance Search via Visual-Semantic Aligned Representation Learning

We introduce the task of open-vocabulary visual instance search (OVIS). Given an arbitrary textual search query, Open-vocabulary Visual Instance Search (OVIS) aims to return a ranked list of visual instances, i.e., image patches, that satisfies the search intent from an image database. The term “open vocabulary” means that there are neither restrictions to the visual instance to be searched nor restrictions to the word that can be used to compose the textual search query. We propose to address such a search challenge via visual-semantic aligned representation learning (ViSA). ViSA leverages massive image-caption pairs as weak image-level (not instance-level) supervision to learn a rich cross-modal semantic space where the representations of visual instances (not images) and those of textual queries are aligned, thus allowing us to measure the similarities between any visual instance and an arbitrary textual query. To evaluate the performance of ViSA, we build two datasets named OVIS40 and OVIS1600 and also introduce a pipeline for error analysis. Through extensive experiments on the two datasets, we demonstrate ViSA’s ability to search for visual instances in images not available during training given a wide range of textual queries including those composed of uncommon words. Experimental results show that ViSA achieves an mAP@50 of 21.9% on OVIS40 under the most challenging setting and achieves an mAP@6 of 14.9% on OVIS1600 dataset.

[1]  Shiguang Shan,et al.  Deep Supervised Hashing for Fast Image Retrieval , 2016, International Journal of Computer Vision.

[2]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[3]  Qian Zhang,et al.  Forest R-CNN: Large-Vocabulary Long-Tailed Object Detection and Instance Segmentation , 2020, ACM Multimedia.

[4]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[5]  Junsong Yuan,et al.  Fried Binary Embedding for High-Dimensional Visual Features , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[7]  Hans-Peter Kriegel,et al.  Fast nearest neighbor search in high-dimensional space , 1998, Proceedings 14th International Conference on Data Engineering.

[8]  Wei Li,et al.  Cap2Det: Learning to Amplify Weak Caption Supervision for Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Yu Qiao,et al.  A Discriminative Feature Learning Approach for Deep Face Recognition , 2016, ECCV.

[10]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[11]  Junsong Yuan,et al.  Asymmetric Mapping Quantization for Nearest Neighbor Search , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Qi Zhang,et al.  Context-Aware Attention Network for Image-Text Retrieval , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Shiguang Shan,et al.  Deep Supervised Hashing for Fast Image Retrieval , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Junjie Wu,et al.  Adaptive Binary Quantization for Fast Nearest Neighbor Search , 2016, ECAI.

[15]  Jianmin Wang,et al.  Deep Cauchy Hashing for Hamming Space Retrieval , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[17]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[18]  Junsong Yuan,et al.  Product Quantization Network for Fast Image Retrieval , 2018, ECCV.

[19]  C. Lawrence Zitnick,et al.  Edge Boxes: Locating Object Proposals from Edges , 2014, ECCV.

[20]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[21]  Ivan Laptev,et al.  Is object localization for free? - Weakly-supervised learning with convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Yong Dou,et al.  Towards Precise End-to-End Weakly Supervised Object Detection Network , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Yu Cheng,et al.  Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models , 2020, ECCV.

[24]  Bohyung Han,et al.  A fast nearest neighbor search algorithm by nonlinear embedding , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[26]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[27]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[28]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[29]  Michael S. Bernstein,et al.  Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Hongyang Chao,et al.  WSOD2: Learning Bottom-Up and Top-Down Objectness Distillation for Weakly-Supervised Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Ross B. Girshick,et al.  LVIS: A Dataset for Large Vocabulary Instance Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[33]  Yong Jae Lee,et al.  Instance-Aware, Context-Focused, and Memory-Efficient Weakly Supervised Object Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Andrea Vedaldi,et al.  Weakly Supervised Deep Detection Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[37]  Seong Joon Oh,et al.  Evaluating Weakly Supervised Object Localization Methods Right , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[39]  Junsong Yuan,et al.  HOPE: Hierarchical Object Prototype Encoding for Efficient Object Instance Search in Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Amanda Spink,et al.  Searching the Web: the public and their queries , 2001 .

[41]  Jianfeng Gao,et al.  Unified Vision-Language Pre-Training for Image Captioning and VQA , 2020, AAAI.

[42]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Zhe Gan,et al.  HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training , 2020, EMNLP.

[44]  Shih-Fu Chang,et al.  Weakly-supervised VisualBERT: Pre-training without Parallel Images and Captions , 2020, ArXiv.

[45]  Yang Zou,et al.  Comprehensive Attention Self-Distillation for Weakly-Supervised Object Detection , 2020, NeurIPS.

[46]  BentleyJon Louis Multidimensional binary search trees used for associative searching , 1975 .

[47]  Vicente Ordonez,et al.  Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[48]  Hailin Jin,et al.  Product Quantization Network for Fast Visual Search , 2020, International Journal of Computer Vision.