Binaural audio generation via multi-task learning

We present a learning-based approach for generating binaural audio from mono audio using multi-task learning. Our formulation leverages additional information from two related tasks: the binaural audio generation task and the flipped audio classification task. Our learning model extracts spatialization features from the visual and audio input, predicts the left and right audio channels, and judges whether the left and right channels are flipped. First, we extract visual features using ResNet from the video frames. Next, we perform binaural audio generation and flipped audio classification using separate subnetworks based on visual features. Our learning method optimizes the overall loss based on the weighted sum of the losses of the two tasks. We train and evaluate our model on the FAIR-Play dataset and the YouTube-ASMR dataset. We perform quantitative and qualitative evaluations to demonstrate the benefits of our approach over prior techniques.

[1]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[2]  Chuang Gan,et al.  The Sound of Motions , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Andrew Zisserman,et al.  Learnable PINs: Cross-Modal Embeddings for Person Identity , 2018, ECCV.

[5]  Nuno Vasconcelos,et al.  Self-Supervised Generation of Spatial Audio for 360 Video , 2018, NIPS 2018.

[6]  Chuang Gan,et al.  Music Gesture for Visual Sound Separation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Ben P. Milner,et al.  Generating Intelligible Audio Speech From Visual Speech , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Tae-Hyun Oh,et al.  Learning to Localize Sound Source in Visual Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Shinji Watanabe,et al.  Weakly-Supervised Deep Recurrent Neural Networks for Basic Dance Step Generation , 2018, 2019 International Joint Conference on Neural Networks (IJCNN).

[10]  Malcolm Slaney,et al.  Putting a Face to the Voice: Fusing Audio and Visual Signals Across a Video to Determine Speakers , 2017, ArXiv.

[11]  Simon Haykin,et al.  The Cocktail Party Problem , 2005, Neural Computation.

[12]  Nikunj Raghuvanshi,et al.  Aerophones in flatland , 2015, ACM Trans. Graph..

[13]  Andrzej Cichocki,et al.  Nonnegative Matrix and Tensor Factorization T , 2007 .

[14]  Ravish Mehra,et al.  Efficient HRTF-based Spatial Audio for Area and Volumetric Sources , 2016, IEEE Transactions on Visualization and Computer Graphics.

[15]  Ravish Mehra,et al.  Efficient construction of the spatial room impulse response , 2017, 2017 IEEE Virtual Reality (VR).

[16]  Dingzeyu Li,et al.  Scene-aware audio for 360° videos , 2018, ACM Trans. Graph..

[17]  Dinesh Manocha,et al.  Diffraction Kernels for Interactive Sound Propagation in Dynamic Environments , 2018, IEEE Transactions on Visualization and Computer Graphics.

[18]  Dinesh Manocha,et al.  Sound Synthesis, Propagation, and Rendering: A Survey , 2020, ArXiv.

[19]  Rui Wang,et al.  Deep Audio-visual Learning: A Survey , 2020, International Journal of Automation and Computing.

[20]  Timothy R. Langlois,et al.  Scene-Aware Audio Rendering via Deep Acoustic Analysis , 2019, IEEE Transactions on Visualization and Computer Graphics.

[21]  Justin Salamon,et al.  Telling Left From Right: Learning Spatial Correspondence of Sight and Sound , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Patrick Pérez,et al.  Identify, Locate and Separate: Audio-Visual Object Extraction in Large Video Collections Using Weak Supervision , 2018, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[23]  Qiang Yang,et al.  An Overview of Multi-task Learning , 2018 .

[24]  A. Krokstad,et al.  Calculating the acoustical room response by the use of a ray tracing technique , 1968 .

[25]  Sebastian Ruder,et al.  An Overview of Multi-Task Learning in Deep Neural Networks , 2017, ArXiv.

[26]  Hung-Yu Tseng,et al.  Self-Supervised Audio Spatialization with Correspondence Classifier , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[27]  Tuomas Virtanen,et al.  Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Josh H. McDermott The cocktail party problem , 2009, Current Biology.

[29]  Zhaoxiang Zhang,et al.  CMCGAN: A Uniform Framework for Cross-Modal Visual-Audio Mutual Generation , 2017, AAAI.

[30]  Dinesh Manocha,et al.  Wave-ray coupling for interactive sound propagation in large complex scenes , 2013, ACM Trans. Graph..

[31]  Kristen Grauman,et al.  Co-Separating Sounds of Visual Objects , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Luciano Fadiga,et al.  Face Landmark-based Speaker-independent Audio-visual Speech Enhancement in Multi-talker Environments , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Rogério Schmidt Feris,et al.  Learning to Separate Object Sounds by Watching Unlabeled Video , 2018, ECCV.

[34]  Joon Son Chung,et al.  The Conversation: Deep Audio-Visual Speech Enhancement , 2018, INTERSPEECH.

[35]  Dinesh Manocha,et al.  Wave-based sound propagation in large open scenes using an equivalent source formulation , 2013, TOGS.

[36]  Chenliang Xu,et al.  Audio-Visual Event Localization in Unconstrained Videos , 2018, ECCV.

[37]  Xuelong Li,et al.  Deep Co-Clustering for Unsupervised Audiovisual Learning , 2018, ArXiv.

[38]  Bo Dai,et al.  Visually Informed Binaural Audio Generation without Binaural Audios , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Andrew J. Davison,et al.  End-To-End Multi-Task Learning With Attention , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Brian Kingsbury,et al.  New types of deep neural network learning for speech recognition and related applications: an overview , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[41]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[42]  M A Lord Rayleigh,et al.  On Our Perception of the Direotion of a Source of Sound , 1875 .

[43]  Jesper Jensen,et al.  Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Emilia Gómez,et al.  Monoaural Audio Source Separation Using Deep Convolutional Neural Networks , 2017, LVA/ICA.

[45]  Chen Fang,et al.  Visual to Sound: Generating Natural Sound for Videos in the Wild , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Luc Van Gool,et al.  Multi-Task Learning for Dense Prediction Tasks: A Survey. , 2020, IEEE transactions on pattern analysis and machine intelligence.

[47]  Alex Hofmann,et al.  Points2Sound: from mono to binaural audio using 3D point cloud scenes , 2021, EURASIP Journal on Audio, Speech, and Music Processing.

[48]  Chuang Gan,et al.  The Sound of Pixels , 2018, ECCV.

[49]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[50]  Kristen Grauman,et al.  2.5D Visual Sound , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[52]  Dinesh Manocha,et al.  Interactive sound propagation with bidirectional path tracing , 2016, ACM Trans. Graph..

[53]  Adrian Hilton,et al.  Immersive Spatial Audio Reproduction for VR/AR Using Room Acoustic Modelling from 360° Images , 2019, 2019 IEEE Conference on Virtual Reality and 3D User Interfaces (VR).

[54]  F. Wightman,et al.  The dominant role of low-frequency interaural time differences in sound localization. , 1992, The Journal of the Acoustical Society of America.

[55]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[56]  Joon Son Chung,et al.  Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[57]  Xiaogang Wang,et al.  Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation , 2020, ECCV.

[58]  Andrew Owens,et al.  Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Andrew Zisserman,et al.  X2Face: A network for controlling face generation by using images, audio, and pose codes , 2018, ECCV.