论文信息 - Vision Transformers are Good Mask Auto-Labelers

Vision Transformers are Good Mask Auto-Labelers

We propose Mask Auto-Labeler (MAL), a high-quality Transformer-based mask auto-labeling framework for instance segmentation using only box annotations. MAL takes box-cropped images as inputs and conditionally generates their mask pseudo-labels. We show that Vision Transformers are good mask auto-labelers. Our method significantly reduces the gap between auto-labeling and human annotation regarding mask quality. Instance segmentation models trained using the MAL-generated masks can nearly match the performance of their fully-supervised counterparts, retaining up to 97.4% performance of fully supervised models. The best model achieves 44.1% mAP on COCO instance segmentation (test-dev 2017), outperforming state-of-the-art box-supervised methods by significant margins. Qualitative results indicate that masks produced by MAL are, in some cases, even better than human annotations.

[1] Xinggang Wang,et al. BoxTeacher: Exploring High-Quality Pseudo Labels for Weakly Supervised Instance Segmentation , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Li Dong,et al. Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks , 2022, ArXiv.

[3] Jianke Zhu,et al. Box-supervised Instance Segmentation with Level Set Evolution , 2022, ECCV.

[4] H. Shum,et al. Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Anima Anandkumar,et al. Understanding The Robustness in Vision Transformers , 2022, ICML.

[6] M. Cord,et al. DeiT III: Revenge of the ViT , 2022, ECCV.

[7] Ross B. Girshick,et al. Exploring Plain Vision Transformer Backbones for Object Detection , 2022, ECCV.

[8] Trevor Darrell,et al. A ConvNet for the 2020s , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Alexander G. Schwing,et al. Mask2Former for Video Instance Segmentation , 2021, ArXiv.

[10] Yu-Gang Jiang,et al. AdaViT: Adaptive Vision Transformers for Efficient Image Recognition , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Li Dong,et al. Swin Transformer V2: Scaling Up Capacity and Resolution , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Ross B. Girshick,et al. Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Anima Anandkumar,et al. Panoptic SegFormer: Delving Deeper into Panoptic Segmentation with Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Alexander G. Schwing,et al. Per-Pixel Classification is Not All You Need for Semantic Segmentation , 2021, NeurIPS.

[15] Tao Kong,et al. SOLO: A Simple Framework for Instance Segmentation , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16] P. Luo,et al. PVT v2: Improved baselines with Pyramid Vision Transformer , 2021, Computational Visual Media.

[17] Li Dong,et al. BEiT: BERT Pre-Training of Image Transformers , 2021, ICLR.

[18] Yuke Zhu,et al. DiscoBox: Weakly Supervised Instance Segmentation and Semantic Correspondence from Box Supervision , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[19] Ping Luo,et al. PolarMask++: Enhanced Polar Representation for Single-Shot Instance Segmentation and Beyond , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20] Julien Mairal,et al. Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[21] Matthieu Cord,et al. Going deeper with Image Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[22] Xiaojie Jin,et al. DeepViT: Towards Deeper Vision Transformer , 2021, ArXiv.

[23] Xiang Li,et al. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[24] Matthieu Cord,et al. Training data-efficient image transformers & distillation through attention , 2020, ICML.

[25] Zhi Tian,et al. BoxInst: High-Performance Instance Segmentation with Box Annotations , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26] A. Yuille,et al. MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[28] Bin Li,et al. Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[29] Luc Van Gool,et al. Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation , 2020, ECCV.

[30] Nicolas Usunier,et al. End-to-End Object Detection with Transformers , 2020, ECCV.

[31] Tao Kong,et al. SOLOv2: Dynamic and Fast Instance Segmentation , 2020, NeurIPS.

[32] Hao Chen,et al. Conditional Convolutions for Instance Segmentation , 2020, ECCV.

[33] Ross B. Girshick,et al. Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Ke Xu,et al. DeepMask: an algorithm for cloud and cloud shadow detection in optical satellite remote sensing images using deep residual network , 2019, ArXiv.

[35] Jian Sun,et al. Objects365: A Large-Scale, High-Quality Dataset for Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36] Ping Luo,et al. PolarMask: Single Shot Instance Segmentation With Polar Representation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37] Kai Chen,et al. MMDetection: Open MMLab Detection Toolbox and Benchmark , 2019, ArXiv.

[38] Ross B. Girshick,et al. LVIS: A Dataset for Large Vocabulary Instance Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Yong Jae Lee,et al. YOLACT: Real-Time Instance Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40] Kai Chen,et al. Hybrid Task Cascade for Instance Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41] Jordi Pont-Tuset,et al. The Open Images Dataset V4 , 2018, International Journal of Computer Vision.

[42] Ralph R. Martin,et al. Associating Inter-image Salient Instances for Weakly Supervised Semantic Segmentation , 2018, ECCV.

[43] Suha Kwak,et al. Learning Pixel-Level Semantic Affinity with Image-Level Supervision for Weakly Supervised Semantic Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44] Nuno Vasconcelos,et al. Cascade R-CNN: Delving Into High Quality Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[46] Matthieu Cord,et al. WILDCAT: Weakly Supervised Learning of Deep ConvNets for Image Classification, Pointwise Localization and Segmentation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47] Sébastien Ourselin,et al. Generalised Dice overlap as a deep learning loss function for highly unbalanced segmentations , 2017, DLMIA/ML-CDS@MICCAI.

[48] Sabine Süsstrunk,et al. Webly Supervised Semantic Segmentation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[50] Ross B. Girshick,et al. Mask R-CNN , 2017, 1703.06870.

[51] Yuning Jiang,et al. FastMask: Segment Multi-scale Object Candidates in One Shot , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52] Serge J. Belongie,et al. Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53] Yi Li,et al. Fully Convolutional Instance-Aware Semantic Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54] Zhuowen Tu,et al. Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55] Abhishek Das,et al. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[56] Bolei Zhou,et al. Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57] Jian Sun,et al. Instance-Aware Semantic Segmentation via Multi-task Network Cascades , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59] Trevor Darrell,et al. Fully convolutional networks for semantic segmentation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[61] Vladlen Koltun,et al. Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials , 2011, NIPS.

[62] Stephen Lin,et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[63] Yung-Yu Chuang,et al. Weakly Supervised Instance Segmentation using the Bounding Box Tightness Prior , 2019, NeurIPS.