Learning to Localize Sound Source in Visual Scenes

Visual events are usually accompanied by sounds in our daily lives. We pose the question: Can the machine learn the correspondence between visual scene and the sound, and localize the sound source only by observing sound and visual scene pairs like human? In this paper, we propose a novel unsupervised algorithm to address the problem of localizing the sound source in visual scenes. A two-stream network structure which handles each modality, with attention mechanism is developed for sound source localization. Moreover, although our network is formulated within the unsupervised learning framework, it can be extended to a unified architecture with a simple modification for the supervised and semi-supervised learning settings as well. Meanwhile, a new sound source dataset is developed for performance evaluation. Our empirical evaluation shows that the unsupervised method eventually go through false conclusion in some cases. We also show that even with a few supervision, i.e., semi-supervised setup, false conclusion is able to be corrected effectively.

[1]  Benjamin Schrauwen,et al.  Deep content-based music recommendation , 2013, NIPS.

[2]  Michael Elad,et al.  Pixels that sound , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[3]  Yoav Y. Schechner,et al.  Harmony in Motion , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[5]  Andrew Owens,et al.  Ambient Sound Provides Supervision for Visual Learning , 2016, ECCV.

[6]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[7]  Kristen Grauman,et al.  Learning Image Representations Tied to Ego-Motion , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[8]  Jean Charles Bazin,et al.  Suggesting Sounds for Images from Video Collections , 2016, ECCV Workshops.

[9]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[10]  Alessio Del Bue,et al.  Seeing the Sound: A New Multimodal Imaging Device for Computer Vision , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[11]  Antonio Torralba,et al.  See, Hear, and Read: Deep Aligned Representations , 2017, ArXiv.

[12]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[13]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[14]  Harry L. Van Trees,et al.  Optimum Array Processing: Part IV of Detection, Estimation, and Modulation Theory , 2002 .

[15]  B. Skinner Superstition in the pigeon. , 1948, Journal of experimental psychology.

[16]  Javier R. Movellan,et al.  Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.

[17]  Andrew Owens,et al.  Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  David A. Shamma,et al.  YFCC100M , 2015, Commun. ACM.

[19]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[20]  Robert S. Bolia,et al.  Aurally Aided Visual Search in Three-Dimensional Space , 1999, Hum. Factors.

[21]  Christopher Kanan,et al.  Visual question answering: Datasets, algorithms, and future challenges , 2016, Comput. Vis. Image Underst..

[22]  M. Corbetta,et al.  Control of goal-directed and stimulus-driven attention in the brain , 2002, Nature Reviews Neuroscience.

[23]  Mubarak Shah,et al.  Multimodal Analysis for Identification and Segmentation of Moving-Sounding Objects , 2013, IEEE Transactions on Multimedia.

[24]  William W. Gaver What in the World Do We Hear? An Ecological Approach to Auditory Event Perception , 1993 .

[25]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[26]  B. R. Shelton,et al.  The influence of vision on the absolute identification of sound-source position , 1980, Perception & psychophysics.

[27]  Richard L. McKinley,et al.  Aurally Aided Visual Search under Virtual and Free-Field Listening Conditions , 1996, Hum. Factors.

[28]  Nir Ailon,et al.  Deep Metric Learning Using Triplet Network , 2014, SIMBAD.

[29]  Trevor Darrell,et al.  Learning Joint Statistical Models for Audio-Visual Fusion and Segregation , 2000, NIPS.

[30]  B. Kabanoff,et al.  Eye movements in auditory space perception , 1975 .

[31]  Piotr Majdak,et al.  3-D localization of virtual sound sources: Effects of visual environment, pointing method, and training , 2010, Attention, perception & psychophysics.

[32]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.