Adaptively Clustering-Driven Learning for Visual Relationship Detection

Visual relationship detection aims to describe the interactions between pairs of objects, such as person-ride-bike and bike-next to-car triplets. In reality, it is often the case that there exist some groups of strongly correlated relationships, while others are weakly related. Intuitively, the common relationships can be roughly categorized into several types such as geometric (e.g., next to), action (e.g., ride), and so on. However, previous studies ignore the relatedness discovery among multiple relationships, which only lie on a unified space to leverage visual features or statistical dependencies into categories. To tackle this problem, we propose an adaptively clustering-driven network for visual relationship detection, which can implicitly divide the unified relationship space into several subspaces with specific characteristics. Particularly, we propose two novel modules to discover the common distribution space and latent relationship association, respectively, which map pairs of object features into translation subspaces to induce the discriminative relationship clustering. Then, a fused inference is designed to integrate the group-induced representations with the language prior to facilitate the predicate inference. Especially, we design the Frobenius-norm regularization to boost the clustering. To the best of our knowledge, the proposed method is the first supervised framework to realize subject-predicate-object relationship-aware clustering for visual relationship detection. Extensive experiments show that the proposed method can achieve competing performances against the state-of-the-art methods on the Visual Genome dataset. Additional ablation studies further validate its effectiveness.