Remote sensing image captioning via Variational Autoencoder and Reinforcement Learning

Abstract Image captioning, i.e., generating the natural semantic descriptions of given image, is an essential task for machines to understand the content of the image. Remote sensing image captioning is a part of the field. Most of the current remote sensing image captioning models suffered the overfitting problem and failed to utilize the semantic information in images. To this end, we propose a Variational Autoencoder and Reinforcement Learning based Two-stage Multi-task Learning Model (VRTMM) for the remote sensing image captioning task. In the first stage, we finetune the CNN jointly with the Variational Autoencoder. In the second stage, the Transformer generates the text description using both spatial and semantic features. Reinforcement Learning is then applied to enhance the quality of the generated sentences. Our model surpasses the previous state of the art records by a large margin on all seven scores on Remote Sensing Image Caption Dataset. The experiment result indicates our model is effective on remote sensing image captioning and achieves the new state-of-the-art result.

[1]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[2]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[3]  Wanneng Wang,et al.  Expression characterization and actual function of the second pucBA in Rhodobacter sphaeroides. , 2009, Bioscience reports.

[4]  Yi Yang,et al.  Decoupled Novel Object Captioner , 2018, ACM Multimedia.

[5]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[6]  Svetlana Lazebnik,et al.  Improving Image-Sentence Embeddings Using Large Weakly Annotated Photo Collections , 2014, ECCV.

[7]  Lixiang Li,et al.  Captioning Transformer with Stacked Attention Modules , 2018 .

[8]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Xiangtao Zheng,et al.  Semantic Descriptions of High-Resolution Remote Sensing Images , 2019, IEEE Geoscience and Remote Sensing Letters.

[10]  Ramakant Nevatia,et al.  Automatic Concept Discovery from Parallel Text and Visual Corpora , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[11]  Tao Mei,et al.  Boosting Image Captioning with Attributes , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Marc'Aurelio Ranzato,et al.  Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[13]  Yejin Choi,et al.  Composing Simple Image Descriptions using Web-scale N-grams , 2011, CoNLL.

[14]  Chunhua Shen,et al.  What Value Do Explicit High Level Concepts Have in Vision to Language Problems? , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[16]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[17]  Xiaoqiang Lu,et al.  Remote Sensing Image Scene Classification: Benchmark and State of the Art , 2017, Proceedings of the IEEE.

[18]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[20]  Vaibhava Goel,et al.  Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[22]  Carl Doersch,et al.  Tutorial on Variational Autoencoders , 2016, ArXiv.

[23]  Chen Xu A Remote Sensing Image Retrieval Model Based on Semantic Mining , 2009 .

[24]  Zhenwei Shi,et al.  Can a Machine Generate Humanlike Language Descriptions for a Remote Sensing Image? , 2017, IEEE Transactions on Geoscience and Remote Sensing.

[25]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[26]  Harish Bhaskar,et al.  Supervised remote sensing image segmentation using boosted convolutional neural networks , 2016, Knowl. Based Syst..

[27]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[28]  Ioannis B. Theocharis,et al.  GeneSIS: A GA-based fuzzy segmentation algorithm for remote sensing images , 2013, Knowl. Based Syst..

[29]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[30]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[31]  Bo Qu,et al.  Deep semantic understanding of high resolution remote sensing image , 2016, 2016 International Conference on Computer, Information and Telecommunication Systems (CITS).

[32]  Xiangtao Zheng,et al.  Exploring Models and Data for Remote Sensing Image Caption Generation , 2017, IEEE Transactions on Geoscience and Remote Sensing.

[33]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[34]  Yi Yang,et al.  Revisiting EmbodiedQA: A Simple Baseline and Beyond , 2019, IEEE Transactions on Image Processing.

[35]  Richard Socher,et al.  Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Xin Wang,et al.  Description Generation for Remote Sensing Images Using Attribute Attention Mechanism , 2019, Remote. Sens..

[37]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Qi Wu,et al.  Image Captioning and Visual Question Answering Based on Attributes and External Knowledge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Neural Networks , 2013 .

[40]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[41]  Dumitru Erhan,et al.  Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[43]  Liangpei Zhang,et al.  Multi-feature probability topic scene classifier for high spatial resolution remote sensing imagery , 2014, 2014 IEEE Geoscience and Remote Sensing Symposium.

[44]  Yi Yang,et al.  Uncovering the Temporal Context for Video Question Answering , 2017, International Journal of Computer Vision.

[45]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[46]  Yi Yang,et al.  Cascaded Revision Network for Novel Object Captioning , 2019, IEEE Transactions on Circuits and Systems for Video Technology.