论文信息 - Mix and Localize: Localizing Sound Sources in Mixtures

Mix and Localize: Localizing Sound Sources in Mixtures

We present a method for simultaneously localizing multiple sound sources within a visual scene. This task requires a model to both group a sound mixture into individual sources, and to associate them with a visual signal. Our method jointly solves both tasks at once, using a formulation inspired by the contrastive random walk of Jabri et al. We create a graph in which images and separated sounds correspond to nodes, and train a random walker to transition between nodes from different modalities with high return probability. The transition probabilities for this walk are determined by an audio-visual similarity metric that is learned by our model. We show through experiments with musical instruments and human speech that our model can successfully localize multiple sounds, outperforming other self-supervised methods.

Andrew Owens | Ziyang Chen | Xixi Hu

[1] Alexei A. Efros,et al. Learning Pixel Trajectories with Multiscale Contrastive Random Walks , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Andrew Owens,et al. Structure from Silence: Learning Scene Structure from Ambient Sound , 2021, CoRL.

[3] K. K. Rachavarapu,et al. Localize to Binauralize: Audio Spatialization from Visual Sound Source Localization , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[4] Anoop Cherian,et al. Visual Scene Graphs for Audio Source Separation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[5] Kristen Grauman,et al. Move2Hear: Active Audio-Visual Source Separation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[6] Andrea Vedaldi,et al. Localizing Visual Sounds the Hard Way , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Yapeng Tian,et al. Cyclic Co-Learning of Sounding Object Visual Grounding and Sound Separation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Abhinav Valada,et al. There is More than Meets the Eye: Self-Supervised Multi-Object Detection and Tracking with Sound by Distilling Multimodal Knowledge , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Kristen Grauman,et al. VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Weiyao Lin,et al. Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching , 2020, NeurIPS.

[11] Andrew Owens,et al. Self-Supervised Learning of Audio-Visual Objects from Video , 2020, ECCV.

[12] Weiyao Lin,et al. Multiple Sound Sources Localization from Coarse to Fine , 2020, ECCV.

[13] Alexei A. Efros,et al. Space-Time Correspondence as a Contrastive Random Walk , 2020, NeurIPS.

[14] Andrea Vedaldi,et al. Labelling unlabelled videos from scratch with multi-modal self-supervision , 2020, NeurIPS.

[15] Ron J. Weiss,et al. Unsupervised Sound Separation Using Mixture Invariant Training , 2020, NeurIPS.

[16] Justin Salamon,et al. Telling Left From Right: Learning Spatial Correspondence of Sight and Sound , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Kristen Grauman,et al. VisualEchoes: Spatial Image Representation Learning through Echolocation , 2020, ECCV.

[18] Andrew Zisserman,et al. Vggsound: A Large-Scale Audio-Visual Dataset , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Yong Jae Lee,et al. Audiovisual SlowFast Networks for Video Recognition , 2020, ArXiv.

[20] Seong Joon Oh,et al. Evaluating Weakly Supervised Object Localization Methods Right , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Chuang Gan,et al. Self-supervised Audio-visual Co-segmentation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22] Chuang Gan,et al. The Sound of Motions , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23] Allan Jabri,et al. Learning Correspondence From the Cycle-Consistency of Time , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Xuelong Li,et al. Deep Multimodal Clustering for Unsupervised Audiovisual Learning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25] James R. Glass,et al. Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input , 2018, International Journal of Computer Vision.

[26] Oriol Vinyals,et al. Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[27] Sergio Guadarrama,et al. Tracking Emerges by Colorizing Videos , 2018, ECCV.

[28] Joon Son Chung,et al. VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[29] Lorenzo Torresani,et al. Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization , 2018, NeurIPS.

[30] Stella X. Yu,et al. Unsupervised Feature Learning via Non-parametric Instance Discrimination , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31] Joon Son Chung,et al. The Conversation: Deep Audio-Visual Speech Enhancement , 2018, INTERSPEECH.

[32] Andrew Owens,et al. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[33] Chuang Gan,et al. The Sound of Pixels , 2018, ECCV.

[34] Andrew Zisserman,et al. Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35] Chenliang Xu,et al. Audio-Visual Event Localization in Unconstrained Videos , 2018, ECCV.

[36] Tae-Hyun Oh,et al. Learning to Localize Sound Source in Visual Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37] Andrew Zisserman,et al. Objects that Sound , 2017, ECCV.

[38] Tillman Weyde,et al. Singing Voice Separation with Deep U-Net Convolutional Networks , 2017, ISMIR.

[39] Andrew Zisserman,et al. Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40] Jesper Jensen,et al. Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41] Joon Son Chung,et al. Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[42] Antonio Torralba,et al. SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[43] Andrew Owens,et al. Ambient Sound Provides Supervision for Visual Learning , 2016, ECCV.

[44] Andrew Owens,et al. Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45] Bolei Zhou,et al. Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47] Zhuo Chen,et al. Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[48] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[49] Andrzej Cichocki,et al. Nonnegative Matrix and Tensor Factorization T , 2007 .

[50] Sam T. Roweis,et al. One Microphone Source Separation , 2000, NIPS.

[51] H S Colburn,et al. Speech intelligibility and localization in a multi-source environment. , 1999, The Journal of the Acoustical Society of America.

[52] Jitendra Malik,et al. Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.