2.5D Visual Sound

Binaural audio provides a listener with 3D sound sensation, allowing a rich perceptual experience of the scene. However, binaural recordings are scarcely available and require nontrivial expertise and equipment to obtain. We propose to convert common monaural audio into binaural audio by leveraging video. The key idea is that visual frames reveal significant spatial cues that, while explicitly lacking in the accompanying single-channel audio, are strongly linked to it. Our multi-modal approach recovers this link from unlabeled video. We devise a deep convolutional neural network that learns to decode the monaural (single-channel) soundtrack into its binaural counterpart by injecting visual information about object and scene configurations. We call the resulting output 2.5D visual sound---the visual stream helps "lift" the flat single channel audio into spatialized sound. In addition to sound generation, we show the self-supervised representation learned by our network benefits audio-visual source separation. Our video results: http://vision.cs.utexas.edu/projects/2.5D_visual_sound/

[1]  Stephen Gould,et al.  Single image depth estimation from predicted semantic labels , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[2]  Tae-Hyun Oh,et al.  Learning to Localize Sound Source in Visual Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3]  Kazuhiro Iida,et al.  Personalization of head-related transfer functions in the median plane based on the anthropometry of the listener's pinnae. , 2014, The Journal of the Acoustical Society of America.

[4]  Shmuel Peleg,et al.  Visual Speech Enhancement , 2017, INTERSPEECH.

[5]  Tuomas Virtanen,et al.  Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Chen Fang,et al.  Visual to Sound: Generating Natural Sound for Videos in the Wild , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Joon Son Chung,et al.  The Conversation: Deep Audio-Visual Speech Enhancement , 2018, INTERSPEECH.

[8]  Alessio Del Bue,et al.  Seeing the Sound: A New Multimodal Imaging Device for Computer Vision , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[9]  Rémi Gribonval,et al.  Under-Determined Reverberant Audio Source Separation Using a Full-Rank Spatial Covariance Model , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Antonio Torralba,et al.  See, Hear, and Read: Deep Aligned Representations , 2017, ArXiv.

[11]  Robert Höldrich,et al.  A 3D Ambisonic Based Binaural Sound Reproduction System , 2003 .

[12]  Andrew Owens,et al.  Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[13]  Daniel P. W. Ellis,et al.  MIR_EVAL: A Transparent Implementation of Common MIR Metrics , 2014, ISMIR.

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  Paris Smaragdis,et al.  AUDIO/VISUAL INDEPENDENT COMPONENTS , 2003 .

[16]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[18]  Chenliang Xu,et al.  Deep Cross-Modal Audio-Visual Generation , 2017, ACM Multimedia.

[19]  James R. Glass,et al.  Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input , 2018, ECCV.

[20]  Yoav Y. Schechner,et al.  Harmony in Motion , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Nuno Vasconcelos,et al.  Self-Supervised Generation of Spatial Audio for 360 Video , 2018, NIPS 2018.

[22]  Maja Pantic,et al.  Audio-visual object localization and separation using low-rank and sparsity , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Paris Smaragdis,et al.  Deep learning for monaural speech separation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[25]  Jae Lim,et al.  Signal estimation from modified short-time Fourier transform , 1984 .

[26]  Scott Rickard,et al.  Blind separation of speech mixtures via time-frequency masking , 2004, IEEE Transactions on Signal Processing.

[27]  Kristen Grauman,et al.  Im2Flow: Motion Hallucination from Static Images for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  John William Strutt Scientific Papers: Our Perception of the Direction of a Source of Sound , 2009 .

[29]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[30]  Javier R. Movellan,et al.  Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.

[31]  Bhiksha Raj,et al.  Supervised and Semi-supervised Separation of Sounds from Single-Channel Mixtures , 2007, ICA.

[32]  Our Perception of the Direction of a Source of Sound , 1876, Nature.

[33]  Andrew Owens,et al.  Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Lorenzo Torresani,et al.  Co-Training of Audio and Video Representations from Self-Supervised Temporal Synchronization , 2018, ArXiv.

[35]  Patrick Pérez,et al.  Motion informed audio source separation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Hiroaki Kitano,et al.  Real-time speaker localization and speech separation by audio-visual integration , 2002, Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No.02CH37292).

[37]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Trevor Darrell,et al.  Learning Joint Statistical Models for Audio-Visual Fusion and Segregation , 2000, NIPS.

[39]  Daniel P. W. Ellis,et al.  Source separation based on binaural cues and source model constraints , 2008, INTERSPEECH.

[40]  Edgar A. Torres-Gallegos,et al.  Personalization of head-related transfer functions (HRTF) based on automatic photo-anthropometry and inference from a database , 2015 .

[41]  DeLiang Wang,et al.  Deep Learning Based Binaural Speech Separation in Reverberant Environments , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[42]  Jiajun Wu,et al.  Generative Modeling of Audible Shapes for Object Perception , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[43]  Rogério Schmidt Feris,et al.  Learning to Separate Object Sounds by Watching Unlabeled Video , 2018, ECCV.

[44]  Chenliang Xu,et al.  Audio-Visual Event Localization in Unconstrained Videos , 2018, ECCV.

[45]  Volker Gnann SOURCE-FILTER BASED CLUSTERING FOR MONAURAL BLIND SOURCE SEPARATION , 2009 .

[46]  Andrew Zisserman,et al.  Objects that Sound , 2017, ECCV.

[47]  Jae S. Lim,et al.  Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[48]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[49]  Andrew Owens,et al.  Ambient Sound Provides Supervision for Visual Learning , 2016, ECCV.

[50]  Dingzeyu Li,et al.  Scene-aware audio for 360° videos , 2018, ACM Trans. Graph..

[51]  W. Koenig,et al.  Subjective Effects in Binaural Hearing , 1950 .

[52]  Dingzeyu Li,et al.  Scene-Aware Audio for 360\textdegree{} Videos , 2018, ArXiv.

[53]  Gaurav Sharma,et al.  See and listen: Score-informed association of sound tracks to players in chamber music performance videos , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[54]  Christian Jutten,et al.  Two multimodal approaches for single microphone source separation , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[55]  Radu Horaud,et al.  The cocktail party robot: Sound source separation and localisation with an active binaural head , 2012, 2012 7th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[56]  Chuang Gan,et al.  The Sound of Pixels , 2018, ECCV.

[57]  Michael Elad,et al.  Pixels that sound , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).