论文信息 - 3rd Place Solution for PVUW2023 VSS Track: A Large Model for Semantic Segmentation on VSPW

3rd Place Solution for PVUW2023 VSS Track: A Large Model for Semantic Segmentation on VSPW

In this paper, we introduce 3rd place solution for PVUW2023 VSS track. Semantic segmentation is a fundamental task in computer vision with numerous real-world applications. We have explored various image-level visual backbones and segmentation heads to tackle the problem of video semantic segmentation. Through our experimentation, we find that InternImage-H as the backbone and Mask2former as the segmentation head achieves the best performance. In addition, we explore two post-precessing methods: CascadePSP and Segment Anything Model (SAM). Ultimately, our approach obtains 62.60\% and 64.84\% mIoU on the VSPW test set1 and final test set, respectively, securing the third position in the PVUW2023 VSS track.

[1] Ross B. Girshick,et al. Segment Anything , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[2] Jifeng Dai,et al. BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Hongsheng Li,et al. Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Jifeng Dai,et al. Towards All-in-One Pre-Training via Maximizing Multi-Modal Mutual Information , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Hongsheng Li,et al. InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Xiaogang Wang,et al. Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs , 2022, NeurIPS.

[7] Yunchao Wei,et al. Large-scale Video Panoptic Segmentation in the Wild: A Benchmark , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Jifeng Dai,et al. Vision Transformer Adapter for Dense Predictions , 2022, ICLR.

[9] Jifeng Dai,et al. BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers , 2022, ECCV.

[10] Trevor Darrell,et al. A ConvNet for the 2020s , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11] A. Schwing,et al. Masked-attention Mask Transformer for Universal Image Segmentation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Xizhou Zhu,et al. Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Jiaxu Miao,et al. VSPW: A Large-scale Dataset for Video Scene Parsing in the Wild , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Chi-Keung Tang,et al. CascadePSP: Toward Class-Agnostic and Very High-Resolution Segmentation via Global and Local Refinement , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Yuning Jiang,et al. Unified Perceptual Parsing for Scene Understanding , 2018, ECCV.

[16] Vittorio Ferrari,et al. COCO-Stuff: Thing and Stuff Classes in Context , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17] Bolei Zhou,et al. Semantic Understanding of Scenes Through the ADE20K Dataset , 2016, International Journal of Computer Vision.

[18] Sebastian Ramos,et al. The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[20] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[21] Stephen Lin,et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).