论文信息 - WeakTr: Exploring Plain Vision Transformer for Weakly-supervised Semantic Segmentation

WeakTr: Exploring Plain Vision Transformer for Weakly-supervised Semantic Segmentation

This paper explores the properties of the plain Vision Transformer (ViT) for Weakly-supervised Semantic Segmentation (WSSS). The class activation map (CAM) is of critical importance for understanding a classification network and launching WSSS. We observe that different attention heads of ViT focus on different image areas. Thus a novel weight-based method is proposed to end-to-end estimate the importance of attention heads, while the self-attention maps are adaptively fused for high-quality CAM results that tend to have more complete objects. Besides, we propose a ViT-based gradient clipping decoder for online retraining with the CAM results to complete the WSSS task. We name this plain Transformer-based Weakly-supervised learning framework WeakTr. It achieves the state-of-the-art WSSS performance on standard benchmarks, i.e., 78.4% mIoU on the val set of PASCAL VOC 2012 and 50.3% mIoU on the val set of COCO 2014. Code is available at https://github.com/hustvl/WeakTr.

[1] Heliang Zheng,et al. Token Contrast for Weakly-Supervised Semantic Segmentation , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2] S. Sanner,et al. TransCAM: Transformer Attention-based CAM Refinement for Weakly Supervised Semantic Segmentation , 2022, J. Vis. Commun. Image Represent..

[3] F. Pirri,et al. Max Pooling with Vision Transformers reconciles class and shape in weakly supervised semantic segmentation , 2022, ECCV.

[4] Linlin Shen,et al. CLIMS: Cross Language Image Matching for Weakly Supervised Semantic Segmentation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Yunchao Wei,et al. L2G: A Simple Local-to-Global Knowledge Transfer Framework for Weakly Supervised Semantic Segmentation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Seong Joon Oh,et al. Weakly Supervised Semantic Segmentation using Out-of-Distribution Data , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Wanli Ouyang,et al. Multi-class Token Transformer for Weakly Supervised Semantic Segmentation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Lingxiao Yang,et al. Self-supervised Image-specific Prototype Exploration for Weakly Supervised Semantic Segmentation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Chen Wu,et al. Weakly-Supervised Semantic Segmentation with Visual Words Learning and Hybrid Pooling , 2022, International Journal of Computer Vision.

[10] Wayne Zhang,et al. Uncertainty Estimation via Response Scaling for Pseudo-mask Noise Mitigation in Weakly-supervised Semantic Segmentation , 2021, AAAI.

[11] Kuk-Jin Yoon,et al. Adversarial Erasing Framework via Triplet with Gated Pyramid Pooling Layer for Weakly Supervised Semantic Segmentation , 2022, ECCV.

[12] Haoqing Shi,et al. ECS-Net: Improving Weakly Supervised Semantic Segmentation by Using Connections Between Class Activation Maps , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[13] Kuk-Jin Yoon,et al. Unlocking the Potential of Ordinary Classifier: Class-specific Adversarial Erasing Framework for Weakly Supervised Semantic Segmentation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[14] Yuchao Dai,et al. Complementary Patch for Weakly Supervised Semantic Segmentation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[15] Mohammed Bennamoun,et al. Leveraging Auxiliary Tasks with Affinity Learning for Weakly Supervised Semantic Segmentation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[16] Jakob Uszkoreit,et al. How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers , 2021, Trans. Mach. Learn. Res..

[17] Jongwuk Lee,et al. Railroad is not a Train: Saliency as Pseudo-pixel Supervision for Weakly Supervised Semantic Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Cordelia Schmid,et al. Segmenter: Transformer for Semantic Segmentation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[19] Yaowei Wang,et al. Conformer: Local Features Coupling Global Representations for Visual Recognition , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[20] Bolei Zhou,et al. TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[21] Sungroh Yoon,et al. BBAM: Bounding Box Attribution Map for Weakly Supervised Semantic and Instance Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Guosheng Lin,et al. Context Decoupling Augmentation for Weakly Supervised Semantic Segmentation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[23] Jiapei FENG,et al. Deep graph cut network for weakly-supervised semantic segmentation , 2021, Sci. China Inf. Sci..

[24] Matthieu Cord,et al. Training data-efficient image transformers & distillation through attention , 2020, ICML.

[25] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[26] Qiaosong Wang,et al. Weakly-Supervised Semantic Segmentation via Sub-Category Exploration , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Xilin Chen,et al. Self-Supervised Equivariant Attention Mechanism for Weakly Supervised Semantic Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Xiao Han,et al. Weakly Supervised Semantic Segmentation with Boundary Exploration , 2020, ECCV.

[29] Yan Huang,et al. Box-Driven Class-Wise Region Masking and Filling Rate Guided Loss for Weakly Supervised Semantic Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Noel E. O'Connor,et al. Unsupervised label noise modeling and loss correction , 2019, ICML.

[31] Suha Kwak,et al. Weakly Supervised Learning of Instance Segmentation With Inter-Pixel Relations , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[33] Anton van den Hengel,et al. Wider or Deeper: Revisiting the ResNet Model for Visual Recognition , 2016, Pattern Recognit..

[34] Yunchao Wei,et al. Self-Erasing Network for Integral Object Attention , 2018, NeurIPS.

[35] Wenyu Liu,et al. Weakly-Supervised Semantic Segmentation Network with Deep Seeded Region Growing , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36] Xingrui Yu,et al. Co-teaching: Robust training of deep neural networks with extremely noisy labels , 2018, NeurIPS.

[37] Suha Kwak,et al. Learning Pixel-Level Semantic Affinity with Image-Level Supervision for Weakly Supervised Semantic Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38] Bin Yang,et al. Learning to Reweight Examples for Robust Deep Learning , 2018, ICML.

[39] Iasonas Kokkinos,et al. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40] Ian D. Reid,et al. Weakly Supervised Semantic Segmentation Based on Co-segmentation , 2017, BMVC.

[41] Yao Zhao,et al. Object Region Mining with Adversarial Erasing: A Simple Classification to Semantic Segmentation Approach , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42] Yunchao Wei,et al. STC: A Simple to Complex Framework for Weakly-Supervised Semantic Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43] Christoph H. Lampert,et al. Seed, Expand and Constrain: Three Principles for Weakly-Supervised Image Segmentation , 2016, ECCV.

[44] Iasonas Kokkinos,et al. Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[45] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[46] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[47] Vladlen Koltun,et al. Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials , 2011, NIPS.

[48] Subhransu Maji,et al. Semantic contours from inverse detectors , 2011, 2011 International Conference on Computer Vision.

[49] Luc Van Gool,et al. The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[50] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[51] H. Robbins. A Stochastic Approximation Method , 1951 .