TransFGU: A Top-down Approach to Fine-Grained Unsupervised Semantic Segmentation

Unsupervised semantic segmentation aims to obtain high-level semantic representation on low-level visual features without manual annotations. Most existing methods are bottom-up approaches that try to group pixels into regions based on their visual cues or certain predefined rules. As a result, it is difficult for these bottom-up approaches to generate fine-grained semantic segmentation when coming to complicated scenes with multiple objects and some objects sharing similar visual appearance. In contrast, we propose the first top-down unsupervised semantic segmentation framework for fine-grained segmentation in extremely complicated scenarios. Specifically, we first obtain rich high-level structured semantic concept information from large-scale vision data in a self-supervised learning manner, and use such information as a prior to discover potential semantic categories presented in target datasets. Secondly, the discovered high-level semantic categories are mapped to low-level pixel features by calculating the class activate map (CAM) with respect to certain discovered semantic representation. Lastly, the obtained CAMs serve as pseudo labels to train the segmentation module and produce final semantic segmentation. Experimental results on multiple semantic segmentation benchmarks show that our topdown unsupervised segmentation is robust to both objectcentric and scene-centric datasets under different semantic granularity levels, and outperforms all the current stateof-the-art bottom-up methods. Our code is available at https://github.com/damo-cv/TransFGU .

[1]  Xinlei Chen,et al.  Exploring Simple Siamese Representation Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[3]  Cordelia Schmid,et al.  Segmenter: Transformer for Semantic Segmentation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Lior Wolf,et al.  Transformer Interpretability Beyond Attention Visualization , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Myriam Tami,et al.  Autoregressive Unsupervised Image Segmentation , 2020, ECCV.

[6]  Adam Bielski,et al.  Emergence of Object Segmentation in Perturbed Generative Models , 2019, NeurIPS.

[7]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[8]  Weiping Wang,et al.  Dense Semantic Contrast for Self-Supervised Visual Representation Learning , 2021, ACM Multimedia.

[9]  Xu Ji,et al.  Invariant Information Clustering for Unsupervised Image Classification and Segmentation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Ludovic Denoyer,et al.  Unsupervised Object Segmentation by Redrawing , 2019, NeurIPS.

[11]  Kavita Bala,et al.  PiCIE: Unsupervised Semantic Segmentation using Invariance and Equivariance in Clustering , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Kaiming He,et al.  Improved Baselines with Momentum Contrastive Learning , 2020, ArXiv.

[13]  Peter Wonka,et al.  Labels4Free: Unsupervised Segmentation using StyleGAN , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Yutong Lin,et al.  Leveraging Batch Normalization for Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[15]  Ke Gong,et al.  Look into Person: Self-Supervised Structure-Sensitive Learning and a New Benchmark for Human Parsing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Aaron C. Courville,et al.  Unsupervised Learning of Dense Visual Representations , 2020, NeurIPS.

[18]  Robert Harb,et al.  InfoSeg: Unsupervised Semantic Image Segmentation with Mutual Information Maximization , 2021, GCPR.

[19]  Pengfei Wan,et al.  Exploring Set Similarity for Dense Self-supervised Representation Learning , 2021, ArXiv.

[20]  Hamid Rezatofighi,et al.  Unsupervised Image Segmentation by Mutual Information Maximization and Adversarial Regularization , 2021, IEEE Robotics and Automation Letters.

[21]  Tao Kong,et al.  Dense Contrastive Learning for Self-Supervised Visual Pre-Training , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Stella X. Yu,et al.  SegSort: Segmentation by Discriminative Sorting of Segments , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Asako Kanezaki,et al.  Unsupervised Image Segmentation by Backpropagation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Xiaolong Wang,et al.  Rethinking Self-supervised Correspondence Learning: A Video Frame-level Similarity Perspective , 2021, ArXiv.

[25]  Michal Valko,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[26]  Julien Mairal,et al.  Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Byung-Woo Hong,et al.  Unsupervised Segmentation incorporating Shape Prior via Generative Adversarial Networks , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Zhe L. Lin,et al.  Top-Down Neural Attention by Excitation Backprop , 2016, International Journal of Computer Vision.

[29]  Stephen Lin,et al.  Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Yang Liu,et al.  Peer Loss Functions: Learning from Noisy Labels without Knowing Noise Rates , 2019, ICML.

[31]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[33]  Julien Mairal,et al.  Unsupervised Learning of Visual Features by Contrasting Cluster Assignments , 2020, NeurIPS.

[34]  Wonjik Kim,et al.  Unsupervised Learning of Image Segmentation Based on Differentiable Feature Clustering , 2020, IEEE Transactions on Image Processing.

[35]  Lu Yuan,et al.  Efficient Self-supervised Vision Transformers for Representation Learning , 2021, ArXiv.

[36]  Luc Van Gool,et al.  Unsupervised Semantic Segmentation by Contrasting Object Mask Proposals , 2021, ArXiv.

[37]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[38]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.