A Novel Transformer Based Semantic Segmentation Scheme for Fine-Resolution Remote Sensing Images

The fully convolutional network (FCN) with an encoder-decoder architecture has been the standard paradigm for semantic segmentation. The encoder-decoder architecture utilizes an encoder to capture multilevel feature maps, which are incorporated into the final prediction by a decoder. As the context is crucial for precise segmentation, tremendous effort has been made to extract such information in an intelligent fashion, including employing dilated/atrous convolutions or inserting attention modules. However, these endeavors are all based on the FCN architecture with ResNet or other backbones, which cannot fully exploit the context from the theoretical concept. By contrast, we introduce the Swin Transformer as the backbone to extract the context information and design a novel decoder of densely connected feature aggregation module (DCFAM) to restore the resolution and produce the segmentation map. The experimental results on two remotely sensed semantic segmentation datasets demonstrate the effectiveness of the proposed scheme.

[1]  Libo Wang,et al.  Transformer Meets Convolution: A Bilateral Awareness Net-work for Semantic Segmentation of Very Fine Resolution Ur-ban Scene Images , 2021, Remote. Sens..

[2]  Qinglong Zhang,et al.  ResT: An Efficient Transformer for Visual Recognition , 2021, NeurIPS.

[3]  Peter M. Atkinson,et al.  Scale-Aware Neural Network for Semantic Segmentation of Multi-Resolution Remote Sensing Images , 2021, Remote. Sens..

[4]  Rui Li,et al.  ABCNet: Attentive Bilateral Contextual Network for Efficient Semantic Segmentation of Fine-Resolution Remote Sensing Images , 2021, ISPRS Journal of Photogrammetry and Remote Sensing.

[5]  Pieter Abbeel,et al.  Bottleneck Transformers for Visual Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Gui-Song Xia,et al.  Parsing very high resolution urban scene images by learning deep ConvNets with edge-aware loss , 2020, ISPRS Journal of Photogrammetry and Remote Sensing.

[7]  Rui Li,et al.  Multistage Attention ResU-Net for Semantic Segmentation of Fine-Resolution Remote Sensing Images , 2020, IEEE Geoscience and Remote Sensing Letters.

[8]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[9]  Rui Li,et al.  Multiattention Network for Semantic Segmentation of Fine-Resolution Remote Sensing Images , 2020, IEEE Transactions on Geoscience and Remote Sensing.

[10]  Shunyi Zheng,et al.  Land cover classification from remote sensing images based on multi-scale fully convolutional network , 2020, Geo spatial Inf. Sci..

[11]  P. Atkinson,et al.  MACU-Net for Semantic Segmentation of Fine-Resolution Remotely Sensed Images , 2020, IEEE Geoscience and Remote Sensing Letters.

[12]  A. Yuille,et al.  Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation , 2020, ECCV.

[13]  Qinghui Liu,et al.  Dense Dilated Convolutions’ Merging Network for Land Cover Classification , 2020, IEEE Transactions on Geoscience and Remote Sensing.

[14]  Yunchao Wei,et al.  CCNet: Criss-Cross Attention for Semantic Segmentation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Jun Fu,et al.  Dual Attention Network for Scene Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  George Papandreou,et al.  Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation , 2018, ECCV.

[17]  Lingfeng Wang,et al.  Semantic Labeling in Very High Resolution Images via a Self-Cascaded Convolutional Neural Network , 2017, ISPRS Journal of Photogrammetry and Remote Sensing.

[18]  Bertrand Le Saux,et al.  Beyond RGB: Very High Resolution Urban Remote Sensing With Multimodal Deep Networks , 2017, ISPRS Journal of Photogrammetry and Remote Sensing.

[19]  George Papandreou,et al.  Rethinking Atrous Convolution for Semantic Image Segmentation , 2017, ArXiv.

[20]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[21]  Xiangyu Zhang,et al.  Large Kernel Matters — Improve Semantic Segmentation by Global Convolutional Network , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Uwe Stilla,et al.  Classification With an Edge: Improving Semantic Image Segmentation with Boundary Detection , 2016, ISPRS Journal of Photogrammetry and Remote Sensing.

[23]  Xiaogang Wang,et al.  Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Jamie Sherrah,et al.  Fully Convolutional Networks for Dense Semantic Labelling of High-Resolution Aerial Imagery , 2016, ArXiv.

[25]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[28]  Trevor Darrell,et al.  Fully convolutional networks for semantic segmentation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  P. Atkinson,et al.  SaNet: Scale-aware Neural Network for Semantic Labelling of Multiple Spatial Resolution Aerial Images , 2021 .

[31]  W. Hager,et al.  and s , 2019, Shallow Water Hydraulics.

[32]  W. Marsden I and J , 2012 .

[33]  Iroon Polytechniou Influence of cultivation temperature on the ligninolytic activity of selected fungal strains , 2006 .

[34]  and as an in , 2022 .