Audio-Visual Localization by Synthetic Acoustic Image Generation

Acoustic images constitute an emergent data modality for multimodal scene understanding. Such images have the peculiarity to distinguish the spectral signature of sounds coming from different directions in space, thus providing richer information than the one derived from mono and binaural microphones. However, acoustic images are typically generated by cumbersome microphone arrays, which are not as widespread as ordinary microphones mounted on optical cameras. To exploit this empowered modality while using standard microphones and cameras we propose to leverage the generation of synthetic acoustic images from common audio-video data for the task of audio-visual localization. The generation of synthetic acoustic images is obtained by a novel deep architecture, based on Variational Autoencoder and U-Net models, which is trained to reconstruct the ground truth spatialized audio data collected by a microphone array, from the associated video and its corresponding monaural audio signal. Namely, the model learns how to mimic what an array of microphones can produce in the same conditions. We assess the quality of the generated synthetic acoustic images on the task of unsupervised sound source localization in a qualitative and quantitative manner, while also considering standard generation metrics. Our model is evaluated by considering both multimodal datasets containing acoustic images, used for the training, and unseen datasets containing just monaural audio signals and RGB frames, showing to reach more accurate localization results as compared to the state of the art.

[1]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[2]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Pierre Vandergheynst,et al.  Learning Bimodal Structure in Audio–Visual Data , 2009, IEEE Transactions on Neural Networks.

[4]  Arpit Jain,et al.  From Symbols to Signals: Symbolic Variational Autoencoders , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Asim Munawar,et al.  Text to image generative model using constrained embedding space mapping , 2017, 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).

[6]  Weiyao Lin,et al.  Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching , 2020, NeurIPS.

[7]  James R. Glass,et al.  Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input , 2018, ECCV.

[8]  Ali Razavi,et al.  Generating Diverse High-Fidelity Images with VQ-VAE-2 , 2019, NeurIPS.

[9]  Chuang Gan,et al.  The Sound of Motions , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Masahiro Suzuki,et al.  Joint Multimodal Learning with Deep Generative Models , 2016, ICLR.

[11]  Justin Salamon,et al.  Telling Left From Right: Learning Spatial Correspondence of Sight and Sound , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Chuang Gan,et al.  The Sound of Pixels , 2018, ECCV.

[13]  Kristen Grauman,et al.  2.5D Visual Sound , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[15]  Weiyao Lin,et al.  Multiple Sound Sources Localization from Coarse to Fine , 2020, ECCV.

[16]  Xuelong Li,et al.  Deep Multimodal Clustering for Unsupervised Audiovisual Learning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Chuang Gan,et al.  Music Gesture for Visual Sound Separation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Kristen Grauman,et al.  Co-Separating Sounds of Visual Objects , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[20]  Javier R. Movellan,et al.  Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.

[21]  Andrew Owens,et al.  Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning , 2017, International Journal of Computer Vision.

[22]  Vittorio Murino,et al.  Audio-Visual Model Distillation Using Acoustic Images , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[23]  Andrew Owens,et al.  Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[24]  Feiping Nie,et al.  Curriculum Audiovisual Learning , 2020, ArXiv.

[25]  Chenliang Xu,et al.  Audio-Visual Event Localization in Unconstrained Videos , 2018, ECCV.

[26]  Luc Van Gool,et al.  Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds , 2020, ECCV.

[27]  Hiroko Terasawa,et al.  A statistical model of timbre perception , 2006, SAPA@INTERSPEECH.

[28]  Chuang Gan,et al.  Self-Supervised Moving Vehicle Tracking With Stereo Sound , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Mike Wu,et al.  Multimodal Generative Models for Scalable Weakly-Supervised Learning , 2018, NeurIPS.

[30]  Vittorio Murino,et al.  Audio Tracking in Noisy Environments by Acoustic Map and Spectral Signature , 2018, IEEE Transactions on Cybernetics.

[31]  Sukhendu Das,et al.  See the Sound, Hear the Pixels , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[32]  Vittorio Murino,et al.  Leveraging Acoustic Images for Effective Self-supervised Audio Representation Learning , 2020, ECCV.

[33]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[34]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[35]  Tae-Hyun Oh,et al.  Learning to Localize Sound Source in Visual Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Andrew Zisserman,et al.  Objects that Sound , 2017, ECCV.

[38]  Andrew Owens,et al.  Self-Supervised Learning of Audio-Visual Objects from Video , 2020, ECCV.

[39]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[40]  Cordelia Schmid,et al.  How good is my GAN? , 2018, ECCV.

[41]  Jin Young Choi,et al.  Associative Variational Auto-Encoder with Distributed Latent Spaces and Associators , 2020, AAAI.

[42]  Yi Li,et al.  Learning Representations from Audio-Visual Spatial Alignment , 2020, NeurIPS.

[43]  Steven van de Par,et al.  A Binaural Scene Analyzer for Joint Localization and Recognition of Speakers in the Presence of Interfering Noise Sources and Reverberation , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[44]  K. Sreenivasa Rao,et al.  Speech Processing in Mobile Environments , 2014, Springer Briefs in Electrical and Computer Engineering.

[45]  Andrew Zisserman,et al.  Vggsound: A Large-Scale Audio-Visual Dataset , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46]  Alessio Del Bue,et al.  Seeing the Sound: A New Multimodal Imaging Device for Computer Vision , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[47]  Gaël Richard,et al.  Weakly Supervised Representation Learning for Audio-Visual Scene Analysis , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[48]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[49]  Our Perception of the Direction of a Source of Sound , 1876, Nature.