Effective Fusion of Multi-Modal Data with Group Convolutions for Semantic Segmentation of Aerial Imagery

In this paper, we achieve a semantic segmentation of aerial imagery based on the fusion of multi-modal data in an effective way. The multi-modal data contains a true orthophoto and the corresponding normalized Digital Surface Model (nDSM), which are stacked together before they are fed into a Convolutional Neural Network (CNN). Though the two modalities are fused at the early stage, their features are learned independently with group convolutions firstly and then the learned features of different modalities are fused at multiple scales with standard convolutions. Therefore, the multi-scale fusion of multi-modal features is completed in a single-branch convolutional network. In this way, the computational cost is reduced while the experimental results reveal that we can still get promising results.

[1]  Bertrand Le Saux,et al.  Beyond RGB: Very High Resolution Urban Remote Sensing With Multimodal Deep Networks , 2017, ISPRS Journal of Photogrammetry and Remote Sensing.

[2]  Michael Weinmann,et al.  RESIDUAL SHUFFLING CONVOLUTIONAL NEURAL NETWORKS FOR DEEP SEMANTIC IMAGE SEGMENTATION USING MULTI-MODAL DATA , 2018 .

[3]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Markus Gerke,et al.  Use of the stair vision library within the ISPRS 2D semantic labeling benchmark (Vaihingen) , 2014 .

[5]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Daniel Cremers,et al.  FuseNet: Incorporating Depth into Semantic Segmentation via Fusion-Based CNN Architecture , 2016, ACCV.

[7]  Michael Weinmann,et al.  SEMANTIC SEGMENTATION OF AERIAL IMAGERY VIA MULTI-SCALE SHUFFLING CONVOLUTIONAL NEURAL NETWORKS WITH DEEP SUPERVISION , 2018, ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences.

[8]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Michael Weinmann,et al.  Geospatial Computer Vision Based on Multi-Modal Data - How Valuable Is Shape Information for the Extraction of Semantic Information? , 2017, Remote. Sens..

[10]  Uwe Stilla,et al.  SEMANTIC SEGMENTATION OF AERIAL IMAGES WITH AN ENSEMBLE OF CNNS , 2016 .

[11]  Menglong Yan,et al.  Semantic Segmentation of Aerial Images With Shuffling Convolutional Neural Networks , 2018, IEEE Geoscience and Remote Sensing Letters.

[12]  Hai Huang,et al.  Effective Fusion of Multi-Modal Remote Sensing Data in a Fully Convolutional Network for Semantic Labeling , 2017, Remote. Sens..

[13]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[14]  Stéphane Mallat,et al.  Rigid-Motion Scattering for Texture Classification , 2014, ArXiv.

[15]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[16]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.