论文信息 - Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching

Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching

Discriminatively localizing sounding objects in cocktail-party, i.e., mixed sound scenes, is commonplace for humans, but still challenging for machines. In this paper, we propose a two-stage learning framework to perform self-supervised class-aware sounding object localization. First, we propose to learn robust object representations by aggregating the candidate sound localization results in the single source scenes. Then, class-aware object localization maps are generated in the cocktail-party scenarios by referring the pre-learned object knowledge, and the sounding objects are accordingly selected by matching audio and visual object category distributions, where the audiovisual consistency is viewed as the self-supervised signal. Experimental results in both realistic and synthesized cocktail-party videos demonstrate that our model is superior in filtering out silent objects and pointing out the location of sounding objects of different classes. Code is available at https://github.com/DTaoo/Discriminative-Sounding-Objects-Localization.

[1] Bernard Ghanem,et al. Self-Supervised Learning by Cross-Modal Audio-Video Clustering , 2019, NeurIPS.

[2] Hyunjung Shim,et al. PsyNet: Self-Supervised Approach to Object Localization Using Point Symmetric Transformation , 2020, AAAI.

[3] Ivan Laptev,et al. Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[4] Peter B. L. Meijer,et al. Multisensory perceptual learning and sensory substitution , 2014, Neuroscience & Biobehavioral Reviews.

[5] Andrew Owens,et al. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[6] Dragomir Anguelov,et al. Self-taught object localization with deep networks , 2014, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[7] Abhishek Das,et al. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[8] Chuang Gan,et al. The Sound of Motions , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9] J. Elman. Learning and development in neural networks: the importance of starting small , 1993, Cognition.

[10] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .

[11] Vineeth N. Balasubramanian,et al. Grad-CAM++: Generalized Gradient-Based Visual Explanations for Deep Convolutional Networks , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[12] Andrew Owens,et al. Ambient Sound Provides Supervision for Visual Learning , 2016, ECCV.

[13] Kristen Grauman,et al. Co-Separating Sounds of Visual Objects , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[14] Dong Chen,et al. Cross-Task Transfer for Geotagged Audiovisual Aerial Scene Recognition , 2020, ECCV.

[15] Andrew Zisserman,et al. Objects that Sound , 2017, ECCV.

[16] Antonio Torralba,et al. SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[17] Chuang Gan,et al. Self-Supervised Moving Vehicle Tracking With Stereo Sound , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[18] Mubarak Shah,et al. Multimodal Analysis for Identification and Segmentation of Moving-Sounding Objects , 2013, IEEE Transactions on Multimedia.

[19] Chuang Gan,et al. The Sound of Pixels , 2018, ECCV.

[20] Ivan Laptev,et al. Is object localization for free? - Weakly-supervised learning with convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Aren Jansen,et al. Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22] Bolei Zhou,et al. Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Javier R. Movellan,et al. Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.

[24] Sidney S. Simon,et al. Merging of the Senses , 2008, Front. Neurosci..

[25] Andrew Zisserman,et al. Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26] Xuelong Li,et al. Deep Multimodal Clustering for Unsupervised Audiovisual Learning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Chenliang Xu,et al. Audio-Visual Event Localization in Unconstrained Videos , 2018, ECCV.

[28] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Andrew Y. Ng,et al. Learning Feature Representations with K-Means , 2012, Neural Networks: Tricks of the Trade.

[30] Feiping Nie,et al. Curriculum Audiovisual Learning , 2020, ArXiv.

[31] Tae-Hyun Oh,et al. Learning to Localize Sound Source in Visual Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32] Weiyao Lin,et al. Multiple Sound Sources Localization from Coarse to Fine , 2020, ECCV.