Local Semantic Enhanced ConvNet for Aerial Scene Recognition

Aerial scene recognition is challenging due to the complicated object distribution and spatial arrangement in a large-scale aerial image. Recent studies attempt to explore the local semantic representation capability of deep learning models, but how to exactly perceive the key local regions remains to be handled. In this paper, we present a local semantic enhanced ConvNet (LSE-Net) for aerial scene recognition, which mimics the human visual perception of key local regions in aerial scenes, in the hope of building a discriminative local semantic representation. Our LSE-Net consists of a context enhanced convolutional feature extractor, a local semantic perception module and a classification layer. Firstly, we design a multi-scale dilated convolution operators to fuse multi-level and multi-scale convolutional features in a trainable manner in order to fully receive the local feature responses in an aerial scene. Then, these features are fed into our two-branch local semantic perception module. In this module, we design a context-aware class peak response (CACPR) measurement to precisely depict the visual impulse of key local regions and the corresponding context information. Also, a spatial attention weight matrix is extracted to describe the importance of each key local region for the aerial scene. Finally, the refined class confidence maps are fed into the classification layer. Exhaustive experiments on three aerial scene classification benchmarks indicate that our LSE-Net achieves the state-of-the-art performance, which validates the effectiveness of our local semantic perception module and CACPR measurement.