SAENet: Self-Supervised Adversarial and Equivariant Network for Weakly Supervised Object Detection in Remote Sensing Images

Weakly supervised object detection (WSOD) in remote sensing images (RSIs) remains a challenge when learning a subtle object detection model with only image-level annotations. Most works tend to optimize the detection model via exploiting the most contributed region, thereby to be dominated by the most discriminative part of an object. Meanwhile, these methods ignore the consistency across different spatial transformations of the same image and always label them with different classes, which introduces potential ambiguities. To tackle these challenges, we propose a unique self-supervised adversarial and equivariant network (SAENet) and aim at learning complementary and consistent visual patterns for WSOD in RSIs. To this end, an adversarial dropout–activation block is first designed to facilitate the entire object detector via adaptively hiding the discriminative parts and highlighting the instance-related regions. Besides, we further introduce a flexible self-supervised transformation equivariance mechanism on each potential instance from multiple spatial transformations to obtain spatially consistent self-supervisions. Accordingly, the obtained supervisions can be leveraged to pursue a more robust and spatially consistent object detector. Comprehensive experiments on the challenging LEarning, VIsion and Remote sensing Laboratory (LEVIR), NorthWestern Polytechnical University (NWPU) VHR-10.v2, and detection in optical RSIs (DIOR) datasets validate that SAENet outperforms the previous state-of-the-art works and achieves 46.2%, 60.7%, and 27.1% mAP, respectively.