DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation

As acquiring pixel-wise annotations of real-world images for semantic segmentation is a costly process, a model can instead be trained with more accessible synthetic data and adapted to real images without requiring their annotations. This process is studied in unsupervised domain adaptation (UDA). Even though a large number of methods propose new adaptation strategies, they are mostly based on outdated network architectures. As the influence of recent network architectures has not been systematically studied, we first benchmark different network architectures for UDA and then propose a novel UDA method, DAFormer, based on the benchmark results. The DAFormer network consists of a Transformer encoder and a multi-level contextaware feature fusion decoder. It is enabled by three simple but crucial training strategies to stabilize the training and to avoid overfitting DAFormer to the source domain: While the Rare Class Sampling on the source domain improves the quality of pseudo-labels by mitigating the confirmation bias of self-training towards common classes, the Thing-Class ImageNet Feature Distance and a learning rate warmup promote feature transfer from ImageNet pretraining. DAFormer significantly improves the state-of-the-art performance by 10.8 mIoU for GTA→Cityscapes and 5.4 mIoU for Synthia→Cityscapes and enables learning even difficult classes such as train, bus, and truck well. The implementation is available at https://github.com/ lhoyer/DAFormer.

[1]  Shanghang Zhang,et al.  Instance Adaptive Self-Training for Unsupervised Domain Adaptation , 2020, ECCV.

[2]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Luc Van Gool,et al.  Model Adaptation with Synthetic and Real Data for Semantic Dense Foggy Scene Understanding , 2018, ECCV.

[4]  Judy Hoffman,et al.  SENTRY: Selective Entropy Optimization via Committee Consistency for Unsupervised Domain Adaptation , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Anima Anandkumar,et al.  SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers , 2021, NeurIPS.

[6]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Luc Van Gool,et al.  Three Ways to Improve Semantic Segmentation with Self-Supervised Depth Estimation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Yizhou Yu,et al.  Coarse-to-Fine Domain Adaptive Semantic Segmentation with Photometric Alignment and Category-Center Regularization , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Xuequan Lu,et al.  Context-Aware Mixup for Domain Adaptive Semantic Segmentation , 2021, ArXiv.

[10]  Ming-Hsuan Yang,et al.  Learning to Adapt Structured Output Space for Semantic Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Changick Kim,et al.  Self-Ensembling With GAN-Based Data Augmentation for Domain Adaptation in Semantic Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Antonio M. López,et al.  The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Stefano Soatto,et al.  FDA: Fourier Domain Adaptation for Semantic Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Wei Zhang,et al.  Classes Matter: A Fine-grained Adversarial Approach to Cross-domain Semantic Segmentation , 2020, ECCV.

[15]  Xiaofeng Liu,et al.  Confidence Regularized Self-Training , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  David Berthelot,et al.  FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence , 2020, NeurIPS.

[18]  George Papandreou,et al.  Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation , 2018, ECCV.

[19]  Yi-Hsuan Tsai,et al.  Domain Adaptation for Structured Output via Discriminative Patch Representations , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Hyeran Byun,et al.  Learning Texture Invariant Representation for Domain Adaptation of Semantic Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Andreas Veit,et al.  Understanding Robustness of Transformers for Image Classification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Yong Wang,et al.  Prototypical Pseudo Label Denoising and Target Structure Learning for Domain Adaptive Semantic Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Yuning Jiang,et al.  Unified Perceptual Parsing for Scene Understanding , 2018, ECCV.

[24]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Jie Li,et al.  SPIGAN: Privileged Adversarial Learning from Simulation , 2018, ICLR.

[26]  Yang Zou,et al.  Domain Adaptation for Semantic Segmentation via Class-Balanced Self-Training , 2018, ArXiv.

[27]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[28]  Iasonas Kokkinos,et al.  Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[29]  Luc Van Gool,et al.  Domain Adaptive Semantic Segmentation with Self-Supervised Depth Estimation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Patrick Pérez,et al.  DADA: Depth-Aware Domain Adaptation in Semantic Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Dacheng Tao,et al.  Category Anchor-Guided Unsupervised Domain Adaptation for Semantic Segmentation , 2019, NeurIPS.

[32]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Pin-Yu Chen,et al.  Vision Transformers are Robust Learners , 2021, AAAI.

[34]  Xilin Chen,et al.  Object-Contextual Representations for Semantic Segmentation , 2019, ECCV.

[35]  Carsten Rother,et al.  Benchmarking the Robustness of Semantic Segmentation Models with Respect to Common Corruptions , 2020, Int. J. Comput. Vis..

[36]  Jinjun Xiong,et al.  Differential Treatment for Stuff and Things: A Simple Unsupervised Domain Adaptation Method for Semantic Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Trevor Darrell,et al.  FCNs in the Wild: Pixel-level Adversarial and Constraint-based Adaptation , 2016, ArXiv.

[38]  Patrick Pérez,et al.  ADVENT: Adversarial Entropy Minimization for Domain Adaptation in Semantic Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[40]  Yi Yang,et al.  Taking a Closer Look at Domain Shift: Category-Level Adversaries for Semantics Consistent Domain Adaptation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Vittorio Ferrari,et al.  COCO-Stuff: Thing and Stuff Classes in Context , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Taesung Park,et al.  CyCADA: Cycle-Consistent Adversarial Domain Adaptation , 2017, ICML.

[43]  Anna Khoreva,et al.  Grid Saliency for Context Explanations of Semantic Segmentation , 2019, NeurIPS.

[44]  Jihan Yang,et al.  Re-distributing Biased Pseudo Labels for Semi-supervised Semantic Segmentation: A Baseline Investigation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[45]  D. Song,et al.  The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[46]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[47]  Jinwoo Shin,et al.  M2m: Imbalanced Classification via Major-to-Minor Translation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[49]  Karan Sapra,et al.  Hierarchical Multi-Scale Attention for Semantic Segmentation , 2020, ArXiv.

[50]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[51]  Derek Hoiem,et al.  Learning without Forgetting , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Lennart Svensson,et al.  ClassMix: Segmentation-Based Data Augmentation for Semi-Supervised Learning , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[53]  Jun Fu,et al.  Dual Attention Network for Scene Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Liyuan Liu,et al.  On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.

[55]  Luc Van Gool,et al.  ACDC: The Adverse Conditions Dataset with Correspondences for Semantic Driving Scene Understanding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[56]  Dong Liu,et al.  High-Resolution Representations for Labeling Pixels and Regions , 2019, ArXiv.

[57]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Martial Hebert,et al.  Learning to Model the Tail , 2017, NIPS.

[59]  Fahad Shahbaz Khan,et al.  Intriguing Properties of Vision Transformers , 2021, NeurIPS.

[60]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[61]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[62]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Xilin Chen,et al.  OCNet: Object Context for Semantic Segmentation , 2021, International Journal of Computer Vision.

[64]  Xiaojuan Qi,et al.  An Adversarial Perturbation Oriented Domain Adaptation Approach for Semantic Segmentation , 2019, AAAI.

[65]  Timo Aila,et al.  Semi-supervised semantic segmentation needs strong, varied perturbations , 2019, BMVC.

[66]  Luc Van Gool,et al.  DLOW: Domain Flow and Applications , 2021, International Journal of Computer Vision.

[67]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[68]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[69]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[70]  Lennart Svensson,et al.  DACS: Domain Adaptation via Cross-domain Mixed Sampling , 2020, ArXiv.

[71]  Dong-Hyun Lee,et al.  Pseudo-Label : The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks , 2013 .

[72]  Luc Van Gool,et al.  Learning Semantic Segmentation From Synthetic Data: A Geometrically Guided Input-Output Adaptation Approach , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[73]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[74]  Matthias Bethge,et al.  ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness , 2018, ICLR.

[75]  Chuchu Han,et al.  Deep Representation Learning on Long-Tailed Data: A Learnable Embedding Augmentation Perspective , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[76]  Fengmao Lv,et al.  Cross-Domain Semantic Segmentation via Domain-Invariant Interactive Relation Transfer , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[77]  Luc Van Gool,et al.  ROAD: Reality Oriented Adaptation for Semantic Segmentation of Urban Scenes , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[78]  Zhedong Zheng,et al.  Rectifying Pseudo Label Learning via Uncertainty Estimation for Domain Adaptive Semantic Segmentation , 2021, Int. J. Comput. Vis..

[79]  Xilin Chen,et al.  Interlaced Sparse Self-Attention for Semantic Segmentation , 2019, ArXiv.

[80]  Nikita Araslanov,et al.  Self-supervised Augmentation Consistency for Adapting Semantic Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[81]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[82]  Abhinav Gupta,et al.  Training Region-Based Object Detectors with Online Hard Example Mining , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[83]  Thomas G. Dietterich,et al.  Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , 2018, ICLR.

[84]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[85]  Xiaogang Wang,et al.  Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[86]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[87]  Alan Yuille,et al.  CReST: A Class-Rebalancing Self-Training Framework for Imbalanced Semi-Supervised Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[88]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[89]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[90]  Vladlen Koltun,et al.  Playing for Data: Ground Truth from Computer Games , 2016, ECCV.

[91]  Luke Melas-Kyriazi,et al.  PixMatch: Unsupervised Domain Adaptation via Pixelwise Consistency Training , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[92]  Ling Shao,et al.  Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, ArXiv.

[93]  Tao Xiang,et al.  Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[94]  Harri Valpola,et al.  Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[95]  Jingang Tan,et al.  SSF-DAN: Separated Semantic Feature Based Domain Adaptation Network for Semantic Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[96]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[97]  Nuno Vasconcelos,et al.  Bidirectional Learning for Domain Adaptation of Semantic Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).