Modeling Two-Stream Correspondence for Visual Sound Separation

Visual sound separation (VSS) aims to obtain each sound component from the mixed audio signals with the guidance of visual information. Existing works mainly capture the global-level audio-visual correspondence and exploit various visual features to enhance the appearance and motion features of visual modality. However, they commonly neglect the intrinsic properties of the audio modality, resulting in less effective audio feature extraction and unbalanced audio-visual correspondence. To tackle this problem, we propose a novel end-to-end framework termed Modeling Two-Stream Correspondence (MTSC) for VSS by explicitly extracting the timbre and content features in audio modality. The proposed MTSC method employs a two-stream architecture to enhance audio-visual correspondence for both the appearance-timbre and motion-content features. Moreover, with the advanced two-stream pipeline, more lightweight appearance and motion features for visual modality are exploited. Extensive experiments conducted on two benchmark musical instrument datasets demonstrate that with the above properties, our MTSC method remarkably outperforms seven state-of-the-art VSS approaches. The implementation code and extensive experimental results of the proposed MTSC method are provided at https://github.com/CFM-MSG/MTSC-VSS.