Exploring Models and Data for Remote Sensing Image Caption Generation

Inspired by recent development of artificial satellite, remote sensing images have attracted extensive attention. Recently, notable progress has been made in scene classification and target detection. However, it is still not clear how to describe the remote sensing image content with accurate and concise sentences. In this paper, we investigate to describe the remote sensing images with accurate and flexible sentences. First, some annotated instructions are presented to better describe the remote sensing images considering the special characteristics of remote sensing images. Second, in order to exhaustively exploit the contents of remote sensing images, a large-scale aerial image data set is constructed for remote sensing image caption. Finally, a comprehensive review is presented on the proposed data set to fully advance the task of remote sensing caption. Extensive experiments on the proposed data set demonstrate that the content of the remote sensing image can be completely described by generating language descriptions. The data set is available at https://github.com/201528014227051/RSICD_optimal.

[1]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[2]  Jefersson Alex dos Santos,et al.  Do deep features generalize from everyday objects to remote sensing and aerial scenes domains? , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[3]  Xiangtao Zheng,et al.  A target detection method for hyperspectral image based on mixture noise model , 2016, Neurocomputing.

[4]  Jon Atli Benediktsson,et al.  Nonlinear Multiple Kernel Learning With Multiple-Structure-Element Extended Morphological Profiles for Hyperspectral Image Classification , 2016, IEEE Transactions on Geoscience and Remote Sensing.

[5]  Xuelong Li,et al.  Local structure learning in high resolution remote sensing image retrieval , 2016, Neurocomputing.

[6]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[7]  DeLiang Wang,et al.  Remote Sensing Image Segmentation by Combining Spectral and Texture Features , 2014, IEEE Transactions on Geoscience and Remote Sensing.

[8]  Jürgen Schmidhuber,et al.  Learning Nonregular Languages: A Comparison of Simple Recurrent Networks and LSTM , 2002, Neural Computation.

[9]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[10]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[11]  Xiangtao Zheng,et al.  Spectral–Spatial Kernel Regularized for Hyperspectral Image Denoising , 2015, IEEE Transactions on Geoscience and Remote Sensing.

[12]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[14]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[15]  Li Fei-Fei,et al.  DenseCap: Fully Convolutional Localization Networks for Dense Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[17]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[18]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[19]  Florent Perronnin,et al.  Large-scale image retrieval with compressed Fisher vectors , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[20]  Gui-Song Xia,et al.  AID: A Benchmark Data Set for Performance Evaluation of Aerial Scene Classification , 2016, IEEE Transactions on Geoscience and Remote Sensing.

[21]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[22]  Xuelong Li,et al.  Latent Semantic Minimal Hashing for Image Retrieval , 2017, IEEE Transactions on Image Processing.

[23]  Xueming Qian,et al.  Semantic Annotation of High-Resolution Satellite Images via Weakly Supervised Learning , 2016, IEEE Transactions on Geoscience and Remote Sensing.

[24]  Shawn D. Newsam,et al.  Bag-of-visual-words and spatial extensions for land-use classification , 2010, GIS '10.

[25]  M. Neubert,et al.  A COMPARISON OF SEGMENTATION PROGRAMS FOR HIGH RESOLUTION REMOTE SENSING DATA , 2004 .

[26]  Xiaoqiang Lu,et al.  Remote Sensing Image Scene Classification: Benchmark and State of the Art , 2017, Proceedings of the IEEE.

[27]  Lei Guo,et al.  Object Detection in Optical Remote Sensing Images Based on Weakly Supervised Learning and High-Level Feature Learning , 2015, IEEE Transactions on Geoscience and Remote Sensing.

[28]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[29]  Xiangtao Zheng,et al.  Joint Dictionary Learning for Multispectral Change Detection , 2017, IEEE Transactions on Cybernetics.

[30]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[31]  Junwei Han,et al.  Learning Rotation-Invariant Convolutional Neural Networks for Object Detection in VHR Optical Remote Sensing Images , 2016, IEEE Transactions on Geoscience and Remote Sensing.

[32]  Aly A. Farag,et al.  A unified framework for MAP estimation in remote sensing image segmentation , 2005, IEEE Transactions on Geoscience and Remote Sensing.

[33]  Yanfeng Gu,et al.  Multiple Kernel Sparse Representation for Airborne LiDAR Data Classification , 2017, IEEE Transactions on Geoscience and Remote Sensing.

[34]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Lorenzo Bruzzone,et al.  Classification of hyperspectral remote sensing images with support vector machines , 2004, IEEE Transactions on Geoscience and Remote Sensing.

[36]  Xiaoqiang Lu,et al.  Substance Dependence Constrained Sparse NMF for Hyperspectral Unmixing , 2015, IEEE Transactions on Geoscience and Remote Sensing.

[37]  Yejin Choi,et al.  Composing Simple Image Descriptions using Web-scale N-grams , 2011, CoNLL.

[38]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[39]  Vicente Ordonez,et al.  Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[40]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[41]  Zhenwei Shi,et al.  Can a Machine Generate Humanlike Language Descriptions for a Remote Sensing Image? , 2017, IEEE Transactions on Geoscience and Remote Sensing.

[42]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[43]  Yiannis Aloimonos,et al.  Corpus-Guided Sentence Generation of Natural Images , 2011, EMNLP.

[44]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Lei Guo,et al.  Effective and Efficient Midlevel Visual Elements-Oriented Land-Use Classification Using VHR Remote Sensing Images , 2015, IEEE Transactions on Geoscience and Remote Sensing.

[46]  Jun Wang,et al.  Single point iterative weighted fuzzy C-means clustering algorithm for remote sensing image segmentation , 2009, Pattern Recognit..

[47]  Sepp Hochreiter,et al.  Untersuchungen zu dynamischen neuronalen Netzen , 1991 .

[48]  Xiangtao Zheng,et al.  Discovering Diverse Subset for Unsupervised Hyperspectral Band Selection , 2017, IEEE Transactions on Image Processing.

[49]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[50]  William J. Emery,et al.  Active Learning Methods for Remote Sensing Image Classification , 2009, IEEE Transactions on Geoscience and Remote Sensing.

[51]  Bo Qu,et al.  Deep semantic understanding of high resolution remote sensing image , 2016, 2016 International Conference on Computer, Information and Telecommunication Systems (CITS).

[52]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[53]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[54]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Jon Atli Benediktsson,et al.  A Novel MKL Model of Integrating LiDAR Data and MSI for Urban Area Classification , 2015, IEEE Transactions on Geoscience and Remote Sensing.

[56]  Ye Yuan,et al.  Review Networks for Caption Generation , 2016, NIPS.

[57]  Jordi Inglada,et al.  Automatic recognition of man-made objects in high resolution optical remote sensing images by SVM classification of geometric image features , 2007 .

[58]  Xiangtao Zheng,et al.  Remote Sensing Scene Classification by Unsupervised Representation Learning , 2017, IEEE Transactions on Geoscience and Remote Sensing.

[59]  Bin Chen,et al.  Penalized Linear Discriminant Analysis of Hyperspectral Imagery for Noise Removal , 2017, IEEE Geoscience and Remote Sensing Letters.

[60]  Bo Du,et al.  Saliency-Guided Unsupervised Feature Learning for Scene Classification , 2015, IEEE Transactions on Geoscience and Remote Sensing.