论文信息 - Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision

Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision

In this paper, we consider the problem of open-vocabulary semantic segmentation (OVS), which aims to segment objects of arbitrary classes instead of pre-deﬁned, closed-set categories. The main contributions are as fol-lows: First , we propose a transformer-based model for OVS, termed as OVSegmentor, which only exploits web-crawled image-text pairs for pre-training without using any mask annotations. OVSegmentor assembles the image pixels into a set of learnable group tokens via a slot-attention based binding module, and aligns the group tokens to the corresponding caption embedding. Second , we propose two proxy tasks for training, namely masked entity completion and cross-image mask consistency. The former aims to infer all masked entities in the caption given the group tokens, that enables the model to learn ﬁne-grained alignment between visual groups and text entities. The latter enforces consistent mask predictions between images that contain shared entities, which encourages the model to learn visual invariance. Third , we construct CC4M dataset for pre-training by ﬁltering CC12M with frequently appeared entities, which signiﬁcantly improves training efﬁciency. Fourth , we perform zero-shot transfer on three benchmark datasets, PASCAL VOC 2012, PASCAL Context, and COCO Object. Our model achieves superior segmentation results over the state-of-the-art method by using only 3% data (4M vs 134M) for pre-training. Code and pre-trained models will be released for future research.

[1] Weidi Xie,et al. Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models , 2022, BMVC.

[2] Alexander Toshev,et al. Perceptual Grouping in Vision-Language Models , 2022, ArXiv.

[3] Samuel Albanie,et al. NamedMask: Distilling Segmenters from Complementary Foundation Models , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[4] Lingxi Xie,et al. Fine-Grained Semantically Aligned Vision-Language Pre-Training , 2022, NeurIPS.

[5] QUAN LIU,et al. Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding , 2022, ECCV.

[6] Samuel Albanie,et al. ReCo: Retrieve and Co-segment for Zero-shot Transfer , 2022, NeurIPS.

[7] Liunian Harold Li,et al. GLIPv2: Unifying Localization and Vision-Language Understanding , 2022, 2206.05836.

[8] T. Zhang,et al. CREAM: Weakly Supervised Object Localization via Class RE-Activation Mapping , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Jungbeom Lee,et al. Bridging the Gap between Classification and Localization for Weakly Supervised Object Localization , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Xinyi Le,et al. Semi-Supervised Semantic Segmentation Using Unreliable Pseudo-Labels , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Lingxiao Yang,et al. Self-supervised Image-specific Prototype Exploration for Weakly Supervised Semantic Segmentation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Shalini De Mello,et al. GroupViT: Semantic Segmentation Emerges from Text Supervision , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Xihui Liu,et al. Bridging Video-text Retrieval with Multiple Choice Questions , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Kilian Q. Weinberger,et al. Language-driven Semantic Segmentation , 2022, ICLR.

[15] Yin Cui,et al. Scaling Open-Vocabulary Image Segmentation with Image-Level Labels , 2021, ECCV.

[16] Junnan Li,et al. Align and Prompt: Video-and-Language Pre-training with Entity Prompts , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Dengxin Dai,et al. Decoupling Zero-Shot Semantic Segmentation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Liunian Harold Li,et al. Grounded Language-Image Pre-training , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Yang Cao,et al. Background Activation Suppression for Weakly Supervised Object Localization , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Hang Li,et al. Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts , 2021, ICML.

[21] Zhenguo Li,et al. FILIP: Fine-grained Interactive Language-Image Pre-Training , 2021, ICLR.

[22] Prem Natarajan,et al. SIGN: Spatial-information Incorporated Generative Network for Generalized Zero-shot Semantic Segmentation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[23] Bumsub Ham,et al. Exploiting a Joint Embedding Space for Generalized Zero-Shot Semantic Segmentation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[24] Junnan Li,et al. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.

[25] Alexander G. Schwing,et al. Per-Pixel Classification is Not All You Need for Semantic Segmentation , 2021, NeurIPS.

[26] Yuhui Yuan,et al. Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Julien Mairal,et al. Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[28] Yongqin Xian,et al. A Closer Look at Self-training for Zero-Label Semantic Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[29] Saining Xie,et al. An Empirical Study of Training Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[30] Andrew Zisserman,et al. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[31] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[32] Radu Soricut,et al. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Quoc V. Le,et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[34] Matthieu Cord,et al. Training data-efficient image transformers & distillation through attention , 2020, ICML.

[35] A. Yuille,et al. MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36] Bo Dai,et al. DenseCLIP: Extract Free Dense Labels from CLIP , 2021, ArXiv.

[37] Siyuan Zhou,et al. Context-aware Feature Generation For Zero-shot Semantic Segmentation , 2020, ACM Multimedia.

[38] Thomas Kipf,et al. Object-Centric Learning with Slot Attention , 2020, NeurIPS.

[39] C. Hudelot,et al. Semi-Supervised Semantic Segmentation With Cross-Consistency Training , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Yu Cheng,et al. UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[41] Timo Aila,et al. Semi-supervised semantic segmentation needs strong, varied perturbations , 2019, BMVC.

[42] Meng Yang,et al. Semi-supervised Semantic Segmentation via Strong-Weak Dual-Branch Network , 2020, ECCV.

[43] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[44] Matthieu Cord,et al. Zero-Shot Semantic Segmentation , 2019, NeurIPS.

[45] Bernt Schiele,et al. Semantic Projection Network for Zero- and Few-Label Semantic Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[47] Suha Kwak,et al. Learning Pixel-Level Semantic Affinity with Image-Level Supervision for Weakly Supervised Semantic Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48] Iasonas Kokkinos,et al. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[50] Bolei Zhou,et al. Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51] Trevor Darrell,et al. Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52] Sanja Fidler,et al. The Role of Context for Object Detection and Semantic Segmentation in the Wild , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[53] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[54] Wiebke Wagner,et al. Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit , 2010, Lang. Resour. Evaluation.

[55] Harold W. Kuhn,et al. The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[56] Luc Van Gool,et al. The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[57] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.