Geospatial relation captioning for high-spatial-resolution images by using an attention-based neural network

ABSTRACT High-spatial-resolution (HSR) remote sensing images serve as carriers of geographic information. Exploring geo-objects and their geospatial relations is fundamental in understanding HSR remote sensing images. To this end, this study proposes an intelligent semantic understanding method for HSR remote sensing images via geospatial relation captions. Firstly, we propose a method of geospatial relation expression to convey the topological, directional and distance relations of geo-objects in HSR images. Secondly, on the basis of images and their geospatial relation captions, an image dataset is constructed for model training. Finally, geospatial relation captioning is implemented for HSR images by using an attention-based deep neural network model. Experimental results demonstrate that the proposed captioning method can effectively provide geospatial semantics for HSR image understanding.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Wang Zhongjing,et al.  Theory and Model of Water Resources Complex Adaptive Allocation System , 2003 .

[3]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[4]  Supratik Mukhopadhyay,et al.  DeepSat: a learning framework for satellite imagery , 2015, SIGSPATIAL/GIS.

[5]  Peter Brimblecombe,et al.  The Big Smoke: A History of Air Pollution in London since Medieval Times by Peter Brimblecombe (review) , 1989, Technology and Culture.

[6]  Shawn D. Newsam,et al.  Bag-of-visual-words and spatial extensions for land-use classification , 2010, GIS '10.

[7]  Max J. Egenhofer,et al.  On the Equivalence of Topological Relations , 1995, Int. J. Geogr. Inf. Sci..

[8]  Xu Jia,et al.  Guiding Long-Short Term Memory for Image Caption Generation , 2015, ArXiv.

[9]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[11]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Chunhua Shen,et al.  What Value Do Explicit High Level Concepts Have in Vision to Language Problems? , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  W. Tobler A Computer Movie Simulating Urban Growth in the Detroit Region , 1970 .

[14]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[15]  Zhenwei Shi,et al.  Can a Machine Generate Humanlike Language Descriptions for a Remote Sensing Image? , 2017, IEEE Transactions on Geoscience and Remote Sensing.

[16]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[17]  Xiangtao Zheng,et al.  Exploring Models and Data for Remote Sensing Image Caption Generation , 2017, IEEE Transactions on Geoscience and Remote Sensing.

[18]  Bo Zhao,et al.  AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding , 2017, ArXiv.

[19]  Gui-Song Xia,et al.  AID: A Benchmark Data Set for Performance Evaluation of Aerial Scene Classification , 2016, IEEE Transactions on Geoscience and Remote Sensing.

[20]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[21]  Svetlana Lazebnik,et al.  Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[22]  DeLiang Wang,et al.  Remote Sensing Image Segmentation by Combining Spectral and Texture Features , 2014, IEEE Transactions on Geoscience and Remote Sensing.

[23]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[24]  Richard Socher,et al.  Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Xu Jia,et al.  Guiding the Long-Short Term Memory Model for Image Caption Generation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  Jefersson Alex dos Santos,et al.  Do deep features generalize from everyday objects to remote sensing and aerial scenes domains? , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[27]  Ruslan Salakhutdinov,et al.  Multimodal Neural Language Models , 2014, ICML.

[28]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Ana Cecília Giacometti Mai,et al.  Aspectos ecológicos do cavalo-marinho Hippocampus reidi no estuário Camurupim/Cardoso, Piauí, Brasil, fornecendo subsídios para a criação de uma Área de Proteção Integral , 2009 .

[30]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[31]  Zhe Gan,et al.  Semantic Compositional Networks for Visual Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[34]  Min Bai,et al.  TorontoCity: Seeing the World with a Million Eyes , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[35]  Xiaoqiang Lu,et al.  Remote Sensing Image Scene Classification: Benchmark and State of the Art , 2017, Proceedings of the IEEE.

[36]  Wei Lee Woon,et al.  Simultaneous extraction of roads and buildings in remote sensing imagery with convolutional neural networks , 2017 .

[37]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[38]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Bo Qu,et al.  Deep semantic understanding of high resolution remote sensing image , 2016, 2016 International Conference on Computer, Information and Telecommunication Systems (CITS).

[40]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).