A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained Vision-language Model

Recently, zero-shot image classification by visionlanguage pre-training has demonstrated incredible achievements, that the model can classify arbitrary category without seeing additional annotated images of that category. However, it is still unclear how to make the zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation. In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP. It is difficult because semantic segmentation and the CLIP model perform on different visual granularity, that semantic segmentation processes on pixels while CLIP performs on images. To remedy the discrepancy on processing granularity, we refuse the use of the prevalent one-stage FCN based framework, and advocate a two-stage semantic segmentation framework, with the first stage extracting generalizable mask proposals and the second stage leveraging an image based CLIP model to perform zero-shot classification on the masked image crops which are generated in the first stage. Our experimental results show that this simple framework surpasses previous state-of-thearts by a large margin: +29.5 hIoU on the Pascal VOC 2012 dataset, and +8.9 hIoU on the COCO Stuff dataset. With its simplicity and strong performance, we hope this framework to serve as a baseline to facilitate the future research. The code are made publicly available at https: //github.com/MendelXu/zsseg.baseline.

[1]  Bumsub Ham,et al.  Exploiting a Joint Embedding Space for Generalized Zero-Shot Semantic Segmentation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[3]  Bolei Zhou,et al.  Scene Parsing through ADE20K Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Yunchao Wei,et al.  Consistent Structural Relation Learning for Zero-Shot Segmentation , 2020, NeurIPS.

[6]  Shuai Wang,et al.  Cap2Seg: Inferring Semantic and Spatial Context from Captions for Zero-Shot Image Segmentation , 2020, ACM Multimedia.

[7]  Christoph H. Lampert,et al.  Attribute-Based Classification for Zero-Shot Visual Object Categorization , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[10]  Fatih Murat Porikli,et al.  Zero-Shot Object Detection: Learning to Simultaneously Recognize and Localize Novel Concepts , 2018, ACCV.

[11]  Ahmed El Kholy,et al.  UNITER: Learning UNiversal Image-TExt Representations , 2019, ECCV 2020.

[12]  Zheng Zhang,et al.  Disentangled Non-Local Neural Networks , 2020, ECCV.

[13]  Byron Boots,et al.  One-Shot Learning for Semantic Segmentation , 2017, BMVC.

[14]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Nazli Ikizler-Cinbis,et al.  Zero-Shot Object Detection by Hybrid Region Embedding , 2018, BMVC.

[16]  Charless C. Fowlkes,et al.  Contour Detection and Hierarchical Image Segmentation , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Hiroaki Hayashi,et al.  Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing , 2021, ACM Comput. Surv..

[18]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[19]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[20]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[21]  Eric P. Xing,et al.  Few-Shot Semantic Segmentation with Prototype Learning , 2018, BMVC.

[22]  C. Lawrence Zitnick,et al.  Edge Boxes: Locating Object Proposals from Edges , 2014, ECCV.

[23]  Frédéric Jurie,et al.  Generating Visual Representations for Zero-Shot Classification , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[24]  Alexander G. Schwing,et al.  Per-Pixel Classification is Not All You Need for Semantic Segmentation , 2021, NeurIPS.

[25]  Bernt Schiele,et al.  Latent Embeddings for Zero-Shot Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Bernt Schiele,et al.  Semantic Projection Network for Zero- and Few-Label Semantic Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Toshihiko Yamasaki,et al.  Zero-Shot Semantic Segmentation via Variational Mapping , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[28]  Ross B. Girshick,et al.  LVIS: A Dataset for Large Vocabulary Instance Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Cordelia Schmid,et al.  Label-Embedding for Image Classification , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Vittorio Ferrari,et al.  COCO-Stuff: Thing and Stuff Classes in Context , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .

[33]  Fengmao Lv,et al.  Learning Unbiased Zero-Shot Semantic Segmentation Networks Via Transductive Transfer , 2020, IEEE Signal Processing Letters.

[34]  Xiuye Gu,et al.  Zero-Shot Detection via Vision and Language Knowledge Distillation , 2021, ArXiv.

[35]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[36]  Matthieu Cord,et al.  Zero-Shot Semantic Segmentation , 2019, NeurIPS.

[37]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Siyuan Zhou,et al.  Context-aware Feature Generation For Zero-shot Semantic Segmentation , 2020, ACM Multimedia.

[40]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Liqing Zhang,et al.  From Pixel to Patch: Synthesize Context-aware Features for Zero-shot Semantic Segmentation , 2020, IEEE transactions on neural networks and learning systems.

[42]  Yongqin Xian,et al.  A Closer Look at Self-training for Zero-Label Semantic Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[43]  Kate Saenko,et al.  Uncertainty-Aware Learning for Zero-Shot Semantic Segmentation , 2020, NeurIPS.

[44]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[45]  Rama Chellappa,et al.  Zero-Shot Object Detection , 2018, ECCV.

[46]  Chen Change Loy,et al.  Learning to Prompt for Vision-Language Models , 2021, ArXiv.

[47]  Lina Yao,et al.  Zero-Shot Object Detection with Textual Descriptions , 2019, AAAI.

[48]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[49]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[50]  Seunghoon Hong,et al.  Decoupled Deep Neural Network for Semi-supervised Semantic Segmentation , 2015, NIPS.

[51]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[52]  Nan Duan,et al.  Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.

[53]  Yang Liu,et al.  Transductive Unbiased Embedding for Zero-Shot Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[54]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.