Bidirectional Guided Attention Network for 3-D Semantic Detection of Remote Sensing Images

Semantic segmentation and disparity estimation are in the research frontier of the computer vision and remote sensing (RS) fields. However, existing methods mostly deal with these two problems separately or use a combination of multiple models to solve these two tasks. Due to a lack of sufficient information sharing and fusion, they still have difficulties in coping with seasonal appearance differences in 3-D RS problems. In this article, we propose a novel multitask learning architecture that considers the bottom–up and up–bottom visual attention mechanism for 3-D semantic detection, named bidirectional guided attention network (BGA-Net). BGA-Net consists of five modules: unified backbone module (UBM), bidirectional guided attention module (BGAM), semantic segmentation module (SSM), feature matching module (FMM), and bidirectional fusion module (BFM). First, in UBM, we use a shared backbone to extract unified features and share them with three branches/modules (BGAM, SSM, and FMM). Then, SSM and FMM branches are applied to estimate segmentation and disparity maps, whereas the third branch/module (BGAM) shares the global features to guide the task-specific learning via attention mechanism. Finally, we fuse the results of the two tasks by BFM to improve the final performance. Extensive experiments demonstrate that: 1) our BGA-Net can handle the two tasks simultaneously and can be trained in an end-to-end way; 2) these modules fully take advantage of the two tasks’ information to share features and enhance the scene understanding ability, effectively against seasons change of RS images; and 3) BGA-Net has notable superiority and greater flexibility and also sets a new state of the art on the urban semantic 3-D (US3D) benchmark. Moreover, BGA-Net also provides insights into the intelligent interpretation of RS data images.

[1]  Bo Li,et al.  NLCA-Net: a non-local context attention network for stereo matching , 2020, APSIPA Transactions on Signal and Information Processing.

[2]  Naoto Yokoya,et al.  2019 IEEE GRSS Data Fusion Contest: Large-Scale Semantic 3D Reconstruction [Technical Committees] , 2019 .

[3]  Yuchao Dai,et al.  SDBF-Net: Semantic and Disparity Bidirectional Fusion Network for 3D Semantic Detection on Incidental Satellite Images , 2019, 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[4]  Rudolf Mester,et al.  SDNet: Semantically Guided Depth Estimation Network , 2019, GCPR.

[5]  Hongyan Zhang,et al.  Multi-Level Fusion of the Multi-Receptive Fields Contextual Networks and Disparity Network for Pairwise Semantic Stereo , 2019, IGARSS 2019 - 2019 IEEE International Geoscience and Remote Sensing Symposium.

[6]  Wei Liu,et al.  Pairwise Stereo Image Disparity and Semantics Estimation with the Combination of U-Net and Pyramid Stereo Matching Network , 2019, IGARSS 2019 - 2019 IEEE International Geoscience and Remote Sensing Symposium.

[7]  Stephen Lin,et al.  GCNet: Non-Local Networks Meet Squeeze-Excitation Networks and Beyond , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[8]  Ruigang Yang,et al.  GA-Net: Guided Aggregation Net for End-To-End Stereo Matching , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Krishna Mohan Buddhiraju,et al.  Dense Stereo Matching Based on Multiobjective Fitness Function—A Genetic Algorithm Optimization Approach for Stereo Correspondence , 2019, IEEE Transactions on Geoscience and Remote Sensing.

[10]  Kaiming He,et al.  Panoptic Feature Pyramid Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Guan Huang,et al.  Attention-Guided Unified Network for Panoptic Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Gregory D. Hager,et al.  Semantic Stereo for Incidental Satellite Images , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[13]  Luigi di Stefano,et al.  Real-Time Self-Adaptive Deep Stereo , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Jun Fu,et al.  Dual Attention Network for Scene Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Hongdong Li,et al.  Open-World Stereo Video Matching with Deep RNN , 2018, ECCV.

[16]  Zhidong Deng,et al.  SegStereo: Exploiting Semantic Information for Disparity Estimation , 2018, ECCV.

[17]  Qian Wang,et al.  Weakly Supervised Semantic Segmentation for Joint Key Local Structure Localization and Classification of Aurora Image , 2018, IEEE Transactions on Geoscience and Remote Sensing.

[18]  Rowel Atienza,et al.  Fast Disparity Estimation Using Dense Networks , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[19]  Shu Kong,et al.  Pixel-wise Attentional Gating for Parsimonious Pixel Labeling , 2018, ArXiv.

[20]  Junjun Jiang,et al.  Guided Locality Preserving Feature Matching for Remote Sensing Image Registration , 2018, IEEE Transactions on Geoscience and Remote Sensing.

[21]  Jocelyn Chanussot,et al.  Dynamic Multicontext Segmentation of Remote Sensing Images Based on Convolutional Networks , 2018, IEEE Transactions on Geoscience and Remote Sensing.

[22]  Andrew J. Davison,et al.  End-To-End Multi-Task Learning With Attention , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Ronald Kemker,et al.  Low-Shot Learning for the Semantic Segmentation of Remote Sensing Imagery , 2018, IEEE Transactions on Geoscience and Remote Sensing.

[24]  Yong-Sheng Chen,et al.  Pyramid Stereo Matching Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  George Papandreou,et al.  Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation , 2018, ECCV.

[26]  Wei Chen,et al.  Learning for Disparity Estimation Through Feature Constancy , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Jing Zhang,et al.  Multi-scale salient object detection with pyramid spatial pooling , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[28]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Gang Sun,et al.  Squeeze-and-Excitation Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Hongdong Li,et al.  Self-Supervised Learning for Stereo Matching with Self-Improving Ability , 2017, ArXiv.

[31]  Bo Li,et al.  Multi-scale 3D deep convolutional neural network for hyperspectral image classification , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[32]  Bo Li,et al.  Monocular Depth Estimation with Hierarchical Fusion of Dilated CNNs and Soft-Weighted-Sum Inference , 2017, Pattern Recognit..

[33]  Marc Pollefeys,et al.  SGM-Nets: Semi-Global Matching with Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  George Papandreou,et al.  Rethinking Atrous Convolution for Semantic Image Segmentation , 2017, ArXiv.

[35]  Xiaojuan Qi,et al.  ICNet for Real-Time Semantic Segmentation on High-Resolution Images , 2017, ECCV.

[36]  Alex Kendall,et al.  End-to-End Learning of Geometry and Context for Deep Stereo Regression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[37]  Lior Wolf,et al.  Improved Stereo Matching with Constant Highway Networks and Reflective Confidence Learning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Serge J. Belongie,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Thomas Brox,et al.  A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Andreas Geiger,et al.  Object scene flow for autonomous vehicles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Andreas Geiger,et al.  Displets: Resolving stereo ambiguities using object knowledge , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Chunhua Shen,et al.  Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[45]  Jianxiong Xiao,et al.  DeepDriving: Learning Affordance for Direct Perception in Autonomous Driving , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[46]  Trevor Darrell,et al.  Fully convolutional networks for semantic segmentation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Yann LeCun,et al.  Computing the stereo matching cost with a convolutional neural network , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Raquel Urtasun,et al.  Efficient Joint Segmentation, Occlusion Labeling, Stereo and Flow Estimation , 2014, ECCV.

[49]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Marc Pollefeys,et al.  Patch Based Confidence Prediction for Dense Disparity Map , 2016, BMVC.