Open-Vocabulary Object Detection Using Captions

Despite the remarkable accuracy of deep neural networks in object detection, they are costly to train and scale due to supervision requirements. Particularly, learning more object categories typically requires proportionally more bounding box annotations. Weakly supervised and zero-shot learning techniques have been explored to scale object detectors to more categories with less supervision, but they have not been as successful and widely adopted as supervised models. In this paper, we put forth a novel formulation of the object detection problem, namely open-vocabulary object detection, which is more general, more practical, and more effective than weakly supervised and zero-shot approaches. We propose a new method to train object detectors using bounding box annotations for a limited set of object categories, as well as image-caption pairs that cover a larger variety of objects at a significantly lower cost. We show that the proposed method can detect and localize objects for which no bounding box annotation is provided during training, at a significantly higher accuracy than zero-shot approaches. Meanwhile, objects with bounding box annotation can be detected almost as accurately as supervised methods, which is significantly better than weakly supervised baselines. Accordingly, we establish a new state of the art for scalable object detection.

[1]  Venkatesh Saligrama,et al.  Don’t Even Look Once: Synthesizing Features for Zero-Shot Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Nick Barnes,et al.  Improved Visual-Semantic Alignment for Zero-Shot Object Detection , 2020, AAAI.

[3]  Shih-Fu Chang,et al.  Weakly-supervised VisualBERT: Pre-training without Parallel Images and Captions , 2020, ArXiv.

[4]  Shafin Rahman,et al.  Transductive Learning for Zero-Shot Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Wei Li,et al.  Cap2Det: Learning to Amplify Weak Caption Supervision for Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Chang Liu,et al.  C-MIL: Continuation Multiple Instance Learning for Weakly Supervised Object Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Justin Johnson,et al.  VirTex: Learning Visual Representations from Textual Annotations , 2020, ArXiv.

[9]  Alexander M. Bronstein,et al.  Learning to Detect and Retrieve Objects From Unlabeled Videos , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[10]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[11]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[12]  Cordelia Schmid,et al.  Weakly Supervised Object Localization with Multi-Fold Multiple Instance Learning , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Wei Li,et al.  Learning to discover and localize visual objects with open vocabulary , 2018, ArXiv.

[14]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[15]  Venkatesh Saligrama,et al.  Zero Shot Detection , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[16]  Andrea Vedaldi,et al.  Weakly Supervised Deep Detection Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Martial Hebert,et al.  Model recommendation: Generating object detectors from few samples , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Yong Jae Lee,et al.  Weakly-Supervised Visual Grounding of Phrases with Linguistic Structures , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Cho-Jui Hsieh,et al.  VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[20]  Rama Chellappa,et al.  Zero-Shot Object Detection , 2018, ECCV.

[21]  Shih-Fu Chang,et al.  Multi-Level Multimodal Common Semantic Space for Image-Phrase Grounding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[23]  Vittorio Ferrari,et al.  Revisiting Knowledge Transfer for Training Object Class Detectors , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[25]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[26]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[28]  C. V. Jawahar,et al.  A Multi-Space Approach to Zero-Shot Object Detection , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[29]  Ramakant Nevatia,et al.  Knowledge Aided Consistency for Weakly Supervised Phrase Grounding , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Rui Wang,et al.  DLWL: Improving Detection for Lowshot Classes With Weakly Labelled Data , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[32]  Jordi Pont-Tuset,et al.  The Open Images Dataset V4 , 2018, International Journal of Computer Vision.

[33]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[34]  Lina Yao,et al.  Zero-Shot Object Detection with Textual Descriptions , 2019, AAAI.

[35]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[36]  C. Lawrence Zitnick,et al.  Edge Boxes: Locating Object Proposals from Edges , 2014, ECCV.

[37]  Ramakant Nevatia,et al.  Automatic Concept Discovery from Parallel Text and Visual Corpora , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[38]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[39]  Trevor Darrell,et al.  LSDA: Large Scale Detection through Adaptation , 2014, NIPS.

[40]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[41]  Ramakant Nevatia,et al.  NOTE-RCNN: NOise Tolerant Ensemble RCNN for Semi-Supervised Object Detection , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[43]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[44]  Yuxing Tang,et al.  Large Scale Semi-Supervised Object Detection Using Visual and Semantic Knowledge Transfer , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Ajay Divakaran,et al.  Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[46]  Jianlong Fu,et al.  Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers , 2020, ArXiv.