Transformer Meets Convolution: A Bilateral Awareness Net-work for Semantic Segmentation of Very Fine Resolution Ur-ban Scene Images

Semantic segmentation from very fine resolution (VFR) urban scene images plays a significant role in several application scenarios including autonomous driving, land cover classification, urban planning, etc. However, the tremendous details contained in the VFR image, especially the considerable variations in scale and appearance of objects, severely limit the potential of the existing deep learning approaches. Addressing such issues represents a promising research field in the remote sensing community, which paves the way for scene-level landscape pattern analysis and decision making. In this paper, we propose a Bilateral Awareness Network which contains a dependency path and a texture path to fully capture the long-range relationships and fine-grained details in VFR images. Specifically, the dependency path is conducted based on the ResT, a novel Transformer backbone with memory-efficient multi-head self-attention, while the texture path is built on the stacked convolution operation. In addition, using the linear attention mechanism, a feature aggregation module is designed to effectively fuse the dependency features and texture features. Extensive experiments conducted on the three large-scale urban scene image segmentation datasets, i.e., ISPRS Vaihingen dataset, ISPRS Potsdam dataset, and UAVid dataset, demonstrate the effectiveness of our BANet. Specifically, a 64.6% mIoU is achieved on the UAVid dataset.

[1]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[2]  Bertrand Le Saux,et al.  Beyond RGB: Very High Resolution Urban Remote Sensing With Multimodal Deep Networks , 2017, ISPRS Journal of Photogrammetry and Remote Sensing.

[3]  Yunchao Wei,et al.  CCNet: Criss-Cross Attention for Semantic Segmentation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Lingfeng Wang,et al.  Semantic Labeling in Very High Resolution Images via a Self-Cascaded Convolutional Neural Network , 2017, ISPRS Journal of Photogrammetry and Remote Sensing.

[5]  Kate Saenko,et al.  Real-Time Semantic Segmentation With Fast Attention , 2020, IEEE Robotics and Automation Letters.

[6]  Qinghui Liu,et al.  Dense Dilated Convolutions’ Merging Network for Land Cover Classification , 2020, IEEE Transactions on Geoscience and Remote Sensing.

[7]  Roberto Cipolla,et al.  Fast-SCNN: Fast Semantic Segmentation Network , 2019, BMVC.

[8]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[9]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[10]  Marin Orsic,et al.  Efficient semantic segmentation with pyramidal fusion , 2021, Pattern Recognit..

[11]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[12]  Michael Ying Yang,et al.  UAVid: A semantic segmentation dataset for UAV imagery , 2018 .

[13]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[14]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Shenghui Fang,et al.  A Novel Transformer Based Semantic Segmentation Scheme for Fine-Resolution Remote Sensing Images , 2021, IEEE Geoscience and Remote Sensing Letters.

[16]  Xin Pan,et al.  Joint Deep Learning for land cover and land use classification , 2019, Remote Sensing of Environment.

[17]  Xiaopeng Zhang,et al.  Robust Rooftop Extraction From Visible Band Images Using Higher Order CRF , 2015, IEEE Transactions on Geoscience and Remote Sensing.

[18]  Michael Kampffmeyer,et al.  Urban Land Cover Classification With Missing Data Modalities Using Deep Convolutional Neural Networks , 2017, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[19]  Mahesh Pal,et al.  Random forest classifier for remote sensing classification , 2005 .

[20]  Juntang Zhuang,et al.  ShelfNet for Fast Semantic Segmentation , 2018, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[21]  Xiaoyu Chen,et al.  Unmanned Aerial Vehicle for Remote Sensing Applications - A Review , 2019, Remote. Sens..

[22]  Qinglong Zhang,et al.  ResT: An Efficient Transformer for Visual Recognition , 2021, NeurIPS.

[23]  Xiaojuan Qi,et al.  ICNet for Real-Time Semantic Segmentation on High-Resolution Images , 2017, ECCV.

[24]  Uwe Stilla,et al.  Classification With an Edge: Improving Semantic Image Segmentation with Boundary Detection , 2016, ISPRS Journal of Photogrammetry and Remote Sensing.

[25]  Peter M. Atkinson,et al.  Scale Sequence Joint Deep Learning (SS-JDL) for land use and land cover classification , 2020, Remote Sensing of Environment.

[26]  Rui Li,et al.  ABCNet: Attentive Bilateral Contextual Network for Efficient Semantic Segmentation of Fine-Resolution Remote Sensing Images , 2021, ISPRS Journal of Photogrammetry and Remote Sensing.

[27]  Xueliang Zhang,et al.  Deep learning in remote sensing applications: A meta-analysis and review , 2019, ISPRS Journal of Photogrammetry and Remote Sensing.

[28]  Jun Fu,et al.  Dual Attention Network for Scene Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Ce Zhang,et al.  Identifying and mapping individual plants in a highly diverse high-elevation ecosystem using UAV imagery and deep learning , 2020 .

[30]  Michael Kampffmeyer,et al.  Semantic Segmentation of Small Objects and Modeling of Uncertainty in Urban Remote Sensing Images Using Deep Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[31]  Leena Matikainen,et al.  Segment-Based Land Cover Mapping of a Suburban Area - Comparison of High-Resolution Remotely Sensed Datasets Using Classification Trees and Test Field Points , 2011, Remote. Sens..

[32]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  K. Seto,et al.  Mapping urbanization dynamics at regional and global scales using multi-temporal DMSP/OLS nighttime light data , 2011 .

[34]  Michael Ying Yang,et al.  Real-time Semantic Segmentation with Context Aggregation Network , 2021, ISPRS Journal of Photogrammetry and Remote Sensing.

[35]  Gang Yu,et al.  BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation , 2018, ECCV.

[36]  Quoc V. Le,et al.  NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[38]  Peter M. Atkinson,et al.  MACU-Net for Semantic Segmentation of Fine-Resolution Remotely Sensed Images , 2022, IEEE Geoscience and Remote Sensing Letters.

[39]  P. Atkinson,et al.  SaNet: Scale-aware Neural Network for Semantic Labelling of Multiple Spatial Resolution Aerial Images , 2021 .

[40]  Rui Li,et al.  MACU-Net Semantic Segmentation from High-Resolution Remote Sensing Images , 2020, ArXiv.

[41]  Gui-Song Xia,et al.  Parsing very high resolution urban scene images by learning deep ConvNets with edge-aware loss , 2020, ISPRS Journal of Photogrammetry and Remote Sensing.

[42]  Michele Volpi,et al.  Land cover mapping at very high resolution with rotation equivariant CNNs: towards small yet accurate models , 2018, ISPRS Journal of Photogrammetry and Remote Sensing.

[43]  Fan Zhang,et al.  TreeUNet: Adaptive Tree convolutional neural networks for subdecimeter aerial image segmentation , 2019, ISPRS Journal of Photogrammetry and Remote Sensing.

[44]  Yunchao Wei,et al.  AlignSeg: Feature-Aligned Segmentation Networks , 2020, ArXiv.

[45]  Xin Pan,et al.  An object-based convolutional neural network (OCNN) for urban land use classification , 2018, Remote Sensing of Environment.

[46]  Pierre Alliez,et al.  High-Resolution Aerial Image Labeling With Convolutional Neural Networks , 2016, IEEE Transactions on Geoscience and Remote Sensing.

[47]  Yuwen Xiong,et al.  PolyTransform: Deep Polygon Transformer for Instance Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Rui Li,et al.  Multiattention Network for Semantic Segmentation of Fine-Resolution Remote Sensing Images , 2020, IEEE Transactions on Geoscience and Remote Sensing.

[49]  Bertrand Le Saux,et al.  Segment-before-Detect: Vehicle Detection and Classification through Semantic Segmentation of Aerial Images , 2017, Remote. Sens..

[50]  Shenghui Fang,et al.  Transformer Meets DCFAM: A Novel Semantic Segmentation Scheme for Fine-Resolution Remote Sensing Images , 2021, ArXiv.

[51]  Yunchao Wei,et al.  CCNet: Criss-Cross Attention for Semantic Segmentation. , 2020, IEEE transactions on pattern analysis and machine intelligence.

[52]  Rui Li,et al.  Multistage Attention ResU-Net for Semantic Segmentation of Fine-Resolution Remote Sensing Images , 2020, IEEE Geoscience and Remote Sensing Letters.

[53]  Peter Caccetta,et al.  ResUNet-a: a deep learning framework for semantic segmentation of remotely sensed data , 2019, ISPRS Journal of Photogrammetry and Remote Sensing.

[54]  Vladlen Koltun,et al.  Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials , 2011, NIPS.

[55]  Zulin Wang,et al.  Road Structure Refined CNN for Road Extraction in Aerial Image , 2017, IEEE Geoscience and Remote Sensing Letters.

[56]  Jamie Sherrah,et al.  Fully Convolutional Networks for Dense Semantic Labelling of High-Resolution Aerial Imagery , 2016, ArXiv.

[57]  Xiuping Jia,et al.  Effective Sequential Classifier Training for SVM-Based Multitemporal Remote Sensing Image Classification , 2017, IEEE Transactions on Image Processing.

[58]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).