Learning Content-enhanced Mask Transformer for Domain Generalized Urban-Scene Segmentation

Domain-generalized urban-scene semantic segmentation (USSS) aims to learn generalized semantic predictions across diverse urban-scene styles. Unlike domain gap challenges, USSS is unique in that the semantic categories are often similar in different urban scenes, while the styles can vary significantly due to changes in urban landscapes, weather conditions, lighting, and other factors. Existing approaches typically rely on convolutional neural networks (CNNs) to learn the content of urban scenes. In this paper, we propose a Content-enhanced Mask TransFormer (CMFormer) for domain-generalized USSS. The main idea is to enhance the focus of the fundamental component, the mask attention mechanism, in Transformer segmentation models on content information. To achieve this, we introduce a novel content-enhanced mask attention mechanism. It learns mask queries from both the image feature and its down-sampled counterpart, as lower-resolution image features usually contain more robust content information and are less sensitive to style variations. These features are fused into a Transformer decoder and integrated into a multi-resolution content-enhanced mask attention learning scheme. Extensive experiments conducted on various domain-generalized urban-scene segmentation datasets demonstrate that the proposed CMFormer significantly outperforms existing CNN-based methods for domain-generalized semantic segmentation, achieving improvements of up to 14.00\% in terms of mIoU (mean intersection over union). The source code for CMFormer will be made available at this \href{https://github.com/BiQiWHU/domain-generalized-urban-scene-segmentation}{repository}.

[1]  Zhiwei Xiong,et al.  Style Projected Clustering for Domain Generalized Semantic Segmentation , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  B. Schiele,et al.  HGFormer: Hierarchical Grouping Transformer for Domain Generalized Semantic Segmentation , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Chunhua Shen,et al.  SegGPT: Segmenting Everything In Context , 2023, ArXiv.

[4]  Ross B. Girshick,et al.  Segment Anything , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Fabrizio J. Piva,et al.  Empirical Generalization Study: Unsupervised Domain Adaptation vs. Domain Generalization Methods for Semantic Segmentation in the Wild , 2023, 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[6]  A. Khoreva,et al.  Intra-Source Style Augmentation for Improved Domain Generalization , 2022, 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[7]  Conghui Hu,et al.  Feature Representation Learning for Unsupervised Cross-domain Image Retrieval , 2022, ECCV.

[8]  Gim Hee Lee,et al.  Adversarial Style Augmentation for Domain Generalized Urban-Scene Segmentation , 2022, NeurIPS.

[9]  Lili Yao,et al.  DIRL: Domain-Invariant Representation Learning for Generalizable Semantic Segmentation , 2022, AAAI.

[10]  Kilian Q. Weinberger,et al.  Ithaca365: Dataset and Driving Perception under Repeated and Challenging Weather Conditions , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Maxwell D. Collins,et al.  CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Wei-Ting Chen,et al.  Learning Multiple Adverse Weather Removal via Two-stage Knowledge Learning and Multi-contrastive Regularization: Toward a Unified Model , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  H. Bischof,et al.  An Efficient Domain-Incremental Learning Approach to Drive in All Weather Conditions , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[14]  Dongbo Min,et al.  Pin the Memory: Learning to Generalize Semantic Segmentation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Gim Hee Lee,et al.  Style-Hallucinated Dual Consistency Learning for Domain Generalized Semantic Segmentation , 2022, ECCV.

[16]  Euntai Kim,et al.  WildNet: Learning Domain Generalized Semantic Segmentation from the Wild , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Yinjie Lei,et al.  Semantic-Aware Domain Generalized Segmentation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Pengfei Zhu,et al.  Label-efficient Hybrid-supervised Learning for Medical Image Segmentation , 2022, AAAI.

[19]  James Hays,et al.  MSeg: A Composite Dataset for Multi-Domain Semantic Segmentation , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  A. Schwing,et al.  Masked-attention Mask Transformer for Universal Image Segmentation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Li Dong,et al.  Swin Transformer V2: Scaling Up Capacity and Resolution , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Mahsa Baktash,et al.  Learning to Diversify for Single Domain Generalization , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Xi Peng,et al.  Out-of-Domain Generalization From a Single Source: An Uncertainty Quantification Approach , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Lingqiao Liu,et al.  Global and Local Texture Randomization for Synthetic-to-Real Semantic Segmentation , 2021, IEEE Transactions on Image Processing.

[25]  Alexander G. Schwing,et al.  Per-Pixel Classification is Not All You Need for Semantic Segmentation , 2021, NeurIPS.

[26]  Ping Liu,et al.  Adversarial Semantic Hallucination for Domain Generalized Semantic Segmentation , 2021, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[27]  Qi Bi,et al.  Learning Calibrated Medical Image Segmentation via Multi-rater Agreement Modeling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Anima Anandkumar,et al.  SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers , 2021, NeurIPS.

[29]  Cordelia Schmid,et al.  Segmenter: Transformer for Semantic Segmentation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Luc Van Gool,et al.  ACDC: The Adverse Conditions Dataset with Correspondences for Semantic Driving Scene Understanding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Seungryong Kim,et al.  RobustNet: Improving Domain Generalization in Urban-Scene Segmentation via Instance Selective Whitening , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  A. Yuille,et al.  MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[34]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[35]  Federico Tombari,et al.  Batch Normalization Embeddings for Deep Domain Generalization , 2020, Pattern Recognit..

[36]  Judy Hoffman,et al.  Learning to Balance Specificity and Invariance for In and Out of Domain Generalization , 2020, ECCV.

[37]  Lequan Yu,et al.  Learning from Extrinsic and Intrinsic Supervisions for Domain Generalization , 2020, ECCV.

[38]  Timothy M. Hospedales,et al.  Learning to Generate Novel Domains for Domain Generalization , 2020, ECCV.

[39]  Eric P. Xing,et al.  Self-Challenging Improves Cross-Domain Generalization , 2020, ECCV.

[40]  Amit Sharma,et al.  Domain Generalization using Causal Matching , 2020, ICML.

[41]  Xi Peng,et al.  Learning to Learn Single Domain Generalization , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Tatsuya Harada,et al.  Domain Generalization Using a Mixture of Multiple Latent Domains , 2019, AAAI.

[43]  K. Keutzer,et al.  Domain Randomization and Pyramid Consistency: Simulation-to-Real Generalization Without Accessing Target Domain Data , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[44]  Yang Zhao,et al.  Deep High-Resolution Representation Learning for Visual Recognition , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Gurumurthy Swaminathan,et al.  d-SNE: Domain Adaptation Using Stochastic Neighborhood Embedding , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Xiaoou Tang,et al.  Switchable Whitening for Deep Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[47]  Lei Huang,et al.  Iterative Normalization: Beyond Standardization Towards Efficient Whitening , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Fabio Maria Carlucci,et al.  Domain Generalization by Solving Jigsaw Puzzles , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Jun Fu,et al.  Dual Attention Network for Scene Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Xiaoou Tang,et al.  Two at Once: Enhancing Learning and Generalization Capacities via IBN-Net , 2018, ECCV.

[51]  Trevor Darrell,et al.  BDD100K: A Diverse Driving Video Database with Scalable Annotation Tooling , 2018, ArXiv.

[52]  Silvio Savarese,et al.  Generalizing to Unseen Domains via Adversarial Data Augmentation , 2018, NeurIPS.

[53]  George Papandreou,et al.  Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation , 2018, ECCV.

[54]  Peter Kontschieder,et al.  The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[55]  Donald A. Adjeroh,et al.  Unified Deep Supervised Domain Adaptation and Generalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[56]  Ian D. Reid,et al.  RefineNet: Multi-path Refinement Networks for High-Resolution Semantic Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Vladlen Koltun,et al.  Playing for Data: Ground Truth from Computer Games , 2016, ECCV.

[58]  Antonio M. López,et al.  The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[62]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[63]  Maxwell D. Collins,et al.  k-means Mask Transformer , 2022, ECCV.

[64]  R. Giryes,et al.  Supplementary Material for Unsupervised Domain Generalization by Learning a Bridge Across Domains , 2022 .

[65]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[66]  Hongseok Namkoong,et al.  Evaluating model performance under worst-case subpopulations , 2021, NeurIPS.

[67]  Tongliang Liu,et al.  Domain Generalization via Entropy Regularization , 2020, NeurIPS.

[68]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.