Denoising-Based Multiscale Feature Fusion for Remote Sensing Image Captioning

Benefiting from deep learning technology, it becomes achievable to generate captions for remote sensing images and great progress has been made in the recent years. However, the large scale variation of remote sensing images, which would lead to errors or omissions in feature extraction, still limits the further improvement of caption quality. To address this problem, we propose a denoising based multi-scale feature fusion (DMSFF) mechanism for remote sensing image captioning in this paper. The proposed DMSFF mechanism aggregates multiscale features with the denoising operation at the stage of visual feature extraction. It can help the encoder-decoder framework, which is widely used in image captioning, to obtain the denoising multi-scale feature representation. In experiments, we apply the proposed DMSFF in the encoder-decoder framework and perform the comparative experiments on two public remote sensing image captioning data sets including UCM-captions and Sydneycaptions. The experimental results demonstrate the effectiveness of our method.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Xuelong Li,et al.  Scene Classification With Recurrent Attention of VHR Remote Sensing Images , 2019, IEEE Transactions on Geoscience and Remote Sensing.

[4]  Bo Qu,et al.  Deep semantic understanding of high resolution remote sensing image , 2016, 2016 International Conference on Computer, Information and Telecommunication Systems (CITS).

[5]  Xuelong Li,et al.  Latent Semantic Minimal Hashing for Image Retrieval , 2017, IEEE Transactions on Image Processing.

[6]  Tat-Seng Chua,et al.  SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Zhenwei Shi,et al.  Can a Machine Generate Humanlike Language Descriptions for a Remote Sensing Image? , 2017, IEEE Transactions on Geoscience and Remote Sensing.

[8]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[9]  Xiangtao Zheng,et al.  Exploring Models and Data for Remote Sensing Image Caption Generation , 2017, IEEE Transactions on Geoscience and Remote Sensing.

[10]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[12]  Xuelong Li,et al.  Multi-Scale Cropping Mechanism for Remote Sensing Image Captioning , 2019, IGARSS 2019 - 2019 IEEE International Geoscience and Remote Sensing Symposium.

[13]  Xiangtao Zheng,et al.  Sound Active Attention Framework for Remote Sensing Image Captioning , 2020, IEEE Transactions on Geoscience and Remote Sensing.

[14]  DeLiang Wang,et al.  Remote Sensing Image Segmentation by Combining Spectral and Texture Features , 2014, IEEE Transactions on Geoscience and Remote Sensing.

[15]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Wei Liu,et al.  Learning to Guide Decoding for Image Captioning , 2018, AAAI.

[17]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[18]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Menglong Yan,et al.  VAA: Visual Aligning Attention Model for Remote Sensing Image Captioning , 2019, IEEE Access.

[20]  Zhe Gan,et al.  Adaptive Feature Abstraction for Translating Video to Text , 2018, AAAI.

[21]  Xuelong Li,et al.  Locality and Structure Regularized Low Rank Representation for Hyperspectral Image Classification , 2019, IEEE Transactions on Geoscience and Remote Sensing.

[22]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[24]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.