MMCAN: Multi-Modal Cross-Attention Network for Free-Space Detection with Uncalibrated Hyperspectral Sensors

Free-space detection plays a pivotal role in autonomous vehicle applications, and its state-of-the-art algorithms are typically based on semantic segmentation of road areas. Recently, hyperspectral images have proven useful supplementary information in multi-modal segmentation for providing more texture details to the RGB representations, thus performing well in road segmentation tasks. Existing multi-modal segmentation methods assume that all the inputs are well-aligned, and then the problem is converted to fuse feature maps from different modalities. However, there exist cases where sensors cannot be well-calibrated. In this paper, we propose a novel network named multi-modal cross-attention network (MMCAN) for multi-modal free-space detection with uncalibrated hyperspectral sensors. We first introduce a cross-modality transformer using hyperspectral data to enhance RGB features, then aggregate these representations alternatively via multiple stages. This transformer promotes the spread and fusion of information between modalities that cannot be aligned at the pixel level. Furthermore, we propose a triplet gate fusion strategy, which can increase the proportion of RGB in the multiple spectral fusion processes while maintaining the specificity of each modality. The experimental results on a multi-spectral dataset demonstrate that our MMCAN model has achieved state-of-the-art performance. The method can be directly used on the pictures taken in the field without complex preprocessing. Our future goal is to adapt the algorithm to multi-object segmentation and generalize it to other multi-modal combinations.

[1]  Xin Lu,et al.  PatchMask: A Data Augmentation Strategy with Gaussian Noise in Hyperspectral Images , 2022, Remote. Sens..

[2]  Fansheng Chen,et al.  Multi-Sensor Fusion of SDGSAT-1 Thermal Infrared and Multispectral Images , 2022, Remote. Sens..

[3]  L. Weng,et al.  Multi-Scale Feature Aggregation Network for Semantic Segmentation of Land Cover , 2022, Remote. Sens..

[4]  A. Molkov,et al.  Aerosol Optical Properties above Productive Waters of Gorky Reservoir for Atmospheric Correction of Sentinel-3/OLCI Images , 2022, Remote. Sens..

[5]  F. Juanes,et al.  Comparing the Use of Red-Edge and Near-Infrared Wavelength Ranges for Detecting Submerged Kelp Canopy , 2022, Remote. Sens..

[6]  Caihong Mu,et al.  A Two-Branch Convolutional Neural Network Based on Multi-Spectral Entropy Rate Superpixel Segmentation for Hyperspectral Image Classification , 2022, Remote. Sens..

[7]  X. Zhang,et al.  OpenMPD: An Open Multimodal Perception Dataset for Autonomous Driving , 2022, IEEE Transactions on Vehicular Technology.

[8]  Yazhou Yao,et al.  Self-Supervised Multi-Modal Hybrid Fusion Network for Brain Tumor Segmentation , 2021, IEEE Journal of Biomedical and Health Informatics.

[9]  Le Sun,et al.  Patch-Wise Semantic Segmentation for Hyperspectral Images via a Cubic Capsule Network with EMAP Features , 2021, Remote. Sens..

[10]  Sotirios A. Tsaftaris,et al.  Disentangle, Align and Fuse for Multimodal and Semi-Supervised Image Segmentation , 2020, IEEE Transactions on Medical Imaging.

[11]  Yang Zhao,et al.  Deep High-Resolution Representation Learning for Visual Recognition , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Klaus C. J. Dietmayer,et al.  Deep Multi-Modal Object Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges , 2019, IEEE Transactions on Intelligent Transportation Systems.

[13]  Zhou Yu,et al.  Multimodal Transformer With Multi-View Visual Representation for Image Captioning , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[14]  Jinhui Tang,et al.  Integrating Dense LiDAR-Camera Road Detection Maps by a Multi-Modal CRF Model , 2019, IEEE Transactions on Vehicular Technology.

[15]  Zhe Chen,et al.  Progressive LiDAR adaptation for road detection , 2019, IEEE/CAA Journal of Automatica Sinica.

[16]  Lennart Svensson,et al.  LIDAR-Camera Fusion for Road Detection Using Fully Convolutional Neural Networks , 2018, Robotics Auton. Syst..

[17]  Wolfram Burgard,et al.  Self-Supervised Model Adaptation for Multimodal Semantic Segmentation , 2018, International Journal of Computer Vision.

[18]  Jing Yuan,et al.  HyperDense-Net: A Hyper-Densely Connected CNN for Multi-Modal Image Segmentation , 2018, IEEE Transactions on Medical Imaging.

[19]  Xuelong Li,et al.  From Deterministic to Generative: Multimodal Stochastic RNNs for Video Captioning , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[20]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Eija Honkavaara,et al.  Quantitative Remote Sensing at Ultra-High Resolution with UAV Spectroscopy: A Review of Sensor Technology, Measurement Procedures, and Data Correction Workflows , 2018, Remote. Sens..

[22]  In So Kweon,et al.  RANUS: RGB and NIR Urban Scene Dataset for Deep Scene Parsing , 2018, IEEE Robotics and Automation Letters.

[23]  Yu Tsao,et al.  Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks , 2017, IEEE Transactions on Emerging Topics in Computational Intelligence.

[24]  Raul Morais,et al.  Hyperspectral Imaging: A Review on UAV-Based Sensors, Data Processing and Applications for Agriculture and Forestry , 2017, Remote. Sens..

[25]  Yongdong Zhang,et al.  Learning Multimodal Attention LSTM Networks for Video Captioning , 2017, ACM Multimedia.

[26]  Mohammad Soleymani,et al.  A survey of multimodal sentiment analysis , 2017, Image Vis. Comput..

[27]  Ying Wang,et al.  Gated Convolutional Neural Network for Semantic Segmentation in High-Resolution Images , 2017, Remote. Sens..

[28]  Henry Leung,et al.  Overview of Environment Perception for Intelligent Vehicles , 2017, IEEE Transactions on Intelligent Transportation Systems.

[29]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Xiahai Zhuang,et al.  Multi-scale patch and multi-modality atlases for whole heart segmentation of MRI , 2016, Medical Image Anal..

[31]  Erik Cambria,et al.  Fusing audio, visual and textual clues for sentiment analysis from multimodal content , 2016, Neurocomputing.

[32]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..