Cross Fusion Net: A Fast Semantic Segmentation Network for Small-Scale Semantic Information Capturing in Aerial Scenes

Capturing accurate multiscale semantic information from the images is of great importance for high-quality semantic segmentation. Over the past years, a large number of methods attempt to improve the multiscale information capturing ability of the networks via various means. However, these methods always suffer unsatisfactory efficiency (e.g., speed or accuracy) on the images that include a large number of small-scale objects, for example, aerial images. In this article, we propose a new network named cross fusion net (CF-Net) for fast and effective extraction of the multiscale semantic information, especially for small-scale semantic information. In particular, the proposed CF-Net can capture more accurate small-scale semantic information from two aspects. On the one hand, we develop a channel attention refinement block to select the informative features. On the other hand, we propose a cross fusion block to enlarge the receptive field of the low-level feature maps. As a result, the network can encode more accurate semantic information from the small-scale objects, and the segmentation accuracy of the small-scale objects is improved accordingly. We have compared the proposed CF-Net with several state-of-the-art semantic segmentation methods on two popular aerial image segmentation data sets. Experimental results reveal that the average $F_{1}$ score gain brought by our CF-Net is about 0.43% and the $F_{1}$ score gain of the small-scale objects (e.g., cars) is about 2.61%. In addition, our CF-Net has the fastest inference speed, which proves its superiority in the aerial scenes. Our code will be released at: https://github.com/pcl111/CF-Net.