Visual Grounding in Remote Sensing Images

Ground object retrieval from a large-scale remote sensing image is very important for lots of applications. We present a novel problem of visual grounding in remote sensing images. Visual grounding aims to locate the particular objects (in the form of the bounding box or segmentation mask) in an image by a natural language expression. The task already exists in the computer vision community. However, existing benchmark datasets and methods mainly focus on natural images rather than remote sensing images. Compared with natural images, remote sensing images contain large-scale scenes and the geographical spatial information of ground objects (e.g., longitude, latitude). The existing method cannot deal with these challenges. In this paper, we collect a new visual grounding dataset, called RSVG, and design a new method, namely GeoVG. In particular, the proposed method consists of a language encoder, image encoder, and fusion module. The language encoder is used to learn numerical geospatial relations and represent a complex expression as a geospatial relation graph. The image encoder is applied to learn large-scale remote sensing scenes with adaptive region attention. The fusion module is used to fuse the text and image feature for visual grounding. We evaluate the proposed method by comparing it to the state-of-the-art methods on RSVG. Experiments show that our method outperforms the previous methods on the proposed datasets. https://sunyuxi.github.io/publication/GeoVG

[1]  Xutao Li,et al.  Multisensor Fusion and Explicit Semantic Preserving-Based Deep Hashing for Cross-Modal Remote Sensing Image Retrieval , 2022, IEEE Transactions on Geoscience and Remote Sensing.

[2]  Krzysztof Janowicz,et al.  Geographic Question Answering: Challenges, Uniqueness, Classification, and Future Directions , 2021, AGILE: GIScience Series.

[3]  Yueting Zhuang,et al.  Disentangled Motif-aware Graph Learning for Phrase Grounding , 2021, AAAI.

[4]  Shenghua Gao,et al.  Look Before You Leap: Learning Landmark Features for One-Stage Visual Grounding , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Yizhou Yu,et al.  Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD Images , 2021, Computer Vision and Pattern Recognition.

[6]  Shih-Fu Chang,et al.  Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding , 2020, AAAI.

[7]  Yizhou Yu,et al.  Relationship-Embedded Representation Learning for Grounding Referring Expressions , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  King Ngi Ngan,et al.  Language-Aware Fine-Grained Object Representation for Referring Expression Comprehension , 2020, ACM Multimedia.

[9]  Jiebo Luo,et al.  Improving One-stage Visual Grounding by Recursive Sub-query Construction , 2020, ECCV.

[10]  Hui Li,et al.  Give Me Something to Eat: Referring Expression Comprehension with Commonsense Knowledge , 2020, ACM Multimedia.

[11]  Yizhou Yu,et al.  Graph-Structured Referring Expression Reasoning in the Wild , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Christopher D. Manning,et al.  Stanza: A Python Natural Language Processing Toolkit for Many Human Languages , 2020, ACL.

[13]  D. Tuia,et al.  RSVQA: Visual Question Answering for Remote Sensing Data , 2020, IEEE Transactions on Geoscience and Remote Sensing.

[14]  Qi Wu,et al.  Cops-Ref: A New Dataset and Task on Compositional Referring Expression Comprehension , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[16]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[17]  Yizhou Yu,et al.  Dynamic Graph Attention for Referring Expression Comprehension , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Jiebo Luo,et al.  A Fast and Accurate One-Stage Approach to Visual Grounding , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Yang Long,et al.  Learning RoI Transformer for Oriented Object Detection in Aerial Images , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Chenxi Liu,et al.  CLEVR-Ref+: Diagnosing Visual Reasoning With Referring Expressions , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Hanwang Zhang,et al.  Learning to Assemble Neural Module Tree Networks for Visual Grounding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[23]  Qi Wu,et al.  Visual Grounding via Accumulated Attention , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Sebastian Riedel,et al.  Numeracy for Language Models: Evaluating and Improving their Ability to Predict Numbers , 2018, ACL.

[25]  Ali Farhadi,et al.  YOLOv3: An Incremental Improvement , 2018, ArXiv.

[26]  Licheng Yu,et al.  MAttNet: Modular Attention Network for Referring Expression Comprehension , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Shih-Fu Chang,et al.  Grounding Referring Expressions in Images by Variational Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Jiebo Luo,et al.  DOTA: A Large-Scale Dataset for Object Detection in Aerial Images , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Qi Wu,et al.  Parallel Attention: A Unified Framework for Visual Object Discovery Through Dialogs and Queries , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[31]  Catherine Havasi,et al.  ConceptNet 5.5: An Open Multilingual Graph of General Knowledge , 2016, AAAI.

[32]  Trevor Darrell,et al.  Modeling Relationships in Referential Expressions with Compositional Modular Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Hugo Larochelle,et al.  GuessWhat?! Visual Object Discovery through Multi-modal Dialogue , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Larry S. Davis,et al.  Modeling Context Between Objects for Referring Expression Understanding , 2016, ECCV.

[35]  Licheng Yu,et al.  Modeling Context in Referring Expressions , 2016, ECCV.

[36]  Trevor Darrell,et al.  Natural Language Object Retrieval , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Alan L. Yuille,et al.  Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Vicente Ordonez,et al.  ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.

[39]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[40]  Hiesik Kim,et al.  Environmental Monitoring Systems: A Review , 2013, IEEE Sensors Journal.

[41]  William J. Emery,et al.  Exploiting SAR and VHR Optical Images to Quantify Damage Caused by the 2003 Bam Earthquake , 2009, IEEE Transactions on Geoscience and Remote Sensing.