VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency

We introduce a new approach for audio-visual speech separation. Given a video, the goal is to extract the speech associated with a face in spite of simultaneous back-ground sounds and/or other human speakers. Whereas existing methods focus on learning the alignment between the speaker’s lip movements and the sounds they generate, we propose to leverage the speaker’s face appearance as an additional prior to isolate the corresponding vocal qualities they are likely to produce. Our approach jointly learns audio-visual speech separation and cross-modal speaker embeddings from unlabeled video. It yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement, and generalizes well to challenging real-world videos of diverse scenarios. Our video results and code: http://vision.cs.utexas.edu/projects/VisualVoice/.

[1]  Efthymios Tzinis,et al.  Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds , 2020, ICLR.

[2]  Weiyao Lin,et al.  Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching , 2020, NeurIPS.

[3]  Andrew Owens,et al.  Self-Supervised Learning of Audio-Visual Objects from Video , 2020, ECCV.

[4]  Xiaogang Wang,et al.  Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation , 2020, ECCV.

[5]  Maja Pantic,et al.  Towards Practical Lipreading with Distilled and Efficient Models , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Joon Son Chung,et al.  Spot the conversation: speaker diarisation in the wild , 2020, INTERSPEECH.

[7]  Joon Son Chung,et al.  FaceFilter: Audio-visual speech separation using still images , 2020, INTERSPEECH.

[8]  K. Grauman,et al.  VisualEchoes: Spatial Image Representation Learning through Echolocation , 2020, ECCV.

[9]  Joon Son Chung,et al.  Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision , 2020, INTERSPEECH.

[10]  Chuang Gan,et al.  Music Gesture for Visual Sound Separation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Yossi Adi,et al.  Voice Separation with an Unknown Number of Multiple Speakers , 2020, ICML.

[12]  Joon Son Chung,et al.  Disentangled Speech Embeddings Using Cross-Modal Self-Supervision , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Maja Pantic,et al.  Lipreading Using Temporal Convolutional Networks , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  K. Grauman,et al.  Listen to Look: Action Recognition by Previewing Audio , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Dahua Lin,et al.  Recursive Visual Sound Separation Using Minus-Plus Net , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Dima Damen,et al.  EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Joon Son Chung,et al.  My lips are concealed: Audio-visual speech enhancement through obstructions , 2019, INTERSPEECH.

[18]  Tae-Hyun Oh,et al.  Speech2Face: Learning the Face Behind a Voice , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Chuang Gan,et al.  Self-supervised Audio-visual Co-segmentation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Kristen Grauman,et al.  Co-Separating Sounds of Visual Objects , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Chuang Gan,et al.  The Sound of Motions , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Kristen Grauman,et al.  2.5D Visual Sound , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Joon Son Chung,et al.  Perfect Match: Improved Cross-modal Embeddings for Audio-visual Synchronisation , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Joon Son Chung,et al.  Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Nuno Vasconcelos,et al.  Self-Supervised Generation of Spatial Audio for 360 Video , 2018, NIPS 2018.

[26]  Andrew Zisserman,et al.  Emotion Recognition in Speech using Cross-Modal Transfer in the Wild , 2018, ACM Multimedia.

[27]  Xiangyu Zhang,et al.  ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design , 2018, ECCV.

[28]  Hang Zhou,et al.  Talking Face Generation by Adversarially Disentangled Audio-Visual Representation , 2018, AAAI.

[29]  Bhiksha Raj,et al.  Disjoint Mapping Network for Cross-modal Matching of Voices and Faces , 2018, ICLR.

[30]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[31]  Tae-Hyun Oh,et al.  On Learning Associations of Faces and Voices , 2018, ACCV.

[32]  Andrew Zisserman,et al.  Learnable PINs: Cross-Modal Embeddings for Person Identity , 2018, ECCV.

[33]  Joon Son Chung,et al.  The Conversation: Deep Audio-Visual Speech Enhancement , 2018, INTERSPEECH.

[34]  Andrew Owens,et al.  Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[35]  Chuang Gan,et al.  The Sound of Pixels , 2018, ECCV.

[36]  Rogério Schmidt Feris,et al.  Learning to Separate Object Sounds by Watching Unlabeled Video , 2018, ECCV.

[37]  Andrew Zisserman,et al.  Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  Chenliang Xu,et al.  Audio-Visual Event Localization in Unconstrained Videos , 2018, ECCV.

[39]  Tae-Hyun Oh,et al.  Learning to Localize Sound Source in Visual Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Andrew Owens,et al.  Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning , 2017, International Journal of Computer Vision.

[41]  Andrew Zisserman,et al.  Objects that Sound , 2017, ECCV.

[42]  Chen Fang,et al.  Visual to Sound: Generating Natural Sound for Videos in the Wild , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43]  Shmuel Peleg,et al.  Visual Speech Enhancement , 2017, INTERSPEECH.

[44]  Shifeng Zhang,et al.  S^3FD: Single Shot Scale-Invariant Face Detector , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[45]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[46]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[47]  Yu Tsao,et al.  Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks , 2017, IEEE Transactions on Emerging Topics in Computational Intelligence.

[48]  Dong Yu,et al.  Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[49]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[50]  Maja Pantic,et al.  Audio-visual object localization and separation using low-rank and sparsity , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[52]  G. Kramer Auditory Scene Analysis: The Perceptual Organization of Sound by Albert Bregman (review) , 2016 .

[53]  Jesper Jensen,et al.  Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[54]  Dinei A. F. Florêncio,et al.  Speech Enhancement in Multiple-Noise Conditions Using Deep Neural Networks , 2016, INTERSPEECH.

[55]  DeLiang Wang,et al.  Complex Ratio Masking for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[56]  Andrew Owens,et al.  Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Björn W. Schuller,et al.  Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR , 2015, LVA/ICA.

[59]  Zhuo Chen,et al.  Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[60]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[61]  Naomi Harte,et al.  TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech , 2015, IEEE Transactions on Multimedia.

[62]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[63]  Paris Smaragdis,et al.  Deep learning for monaural speech separation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[64]  Gregory B. Cogan,et al.  Visual Input Enhances Selective Speech Envelope Tracking in Auditory Cortex at a “Cocktail Party” , 2013, The Journal of Neuroscience.

[65]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[66]  Pierre Vandergheynst,et al.  Blind Audiovisual Source Separation Based on Sparse Redundant Representations , 2010, IEEE Transactions on Multimedia.

[67]  Rémi Gribonval,et al.  Under-Determined Reverberant Audio Source Separation Using a Full-Rank Spatial Covariance Model , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[68]  Bhiksha Raj,et al.  Supervised and Semi-supervised Separation of Sounds from Single-Channel Mixtures , 2007, ICA.

[69]  Torsten Rahne,et al.  Visual cues can modulate integration and segregation of objects in auditory scene analysis , 2007, Brain Research.

[70]  Scott Rickard,et al.  Blind separation of speech mixtures via time-frequency masking , 2004, IEEE Transactions on Signal Processing.

[71]  Hiroaki Kitano,et al.  Real-time speaker localization and speech separation by audio-visual integration , 2002, Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No.02CH37292).

[72]  J.N. Gowdy,et al.  CUAVE: A new audio-visual database for multimodal human-computer interface research , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[73]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[74]  Javier R. Movellan,et al.  Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.

[75]  A. Young,et al.  Understanding face recognition. , 1986, British journal of psychology.

[76]  Jae S. Lim,et al.  Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[77]  Lorenzo Torresani,et al.  Co-Training of Audio and Video Representations from Self-Co-Training of Audio and Video Representations from Self-Supervised Temporal Synchronization Supervised Temporal Synchronization , 2018 .

[78]  Gaël Richard,et al.  INFORMED AUDIO SOURCE SEPARATION , 2017 .

[79]  Daniel P. W. Ellis,et al.  MIR_EVAL: A Transparent Implementation of Common MIR Metrics , 2014, ISMIR.

[80]  Youri Maryn,et al.  Do body mass index and fat volume influence vocal quality, phonatory range, and aerodynamics in females? , 2013, CoDAS.

[81]  Volker Gnann SOURCE-FILTER BASED CLUSTERING FOR MONAURAL BLIND SOURCE SEPARATION , 2009 .

[82]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[83]  Paris Smaragdis,et al.  AUDIO/VISUAL INDEPENDENT COMPONENTS , 2003 .

[84]  Trevor Darrell,et al.  Learning Joint Statistical Models for Audio-Visual Fusion and Segregation , 2000, NIPS.

[85]  Sam T. Roweis,et al.  One Microphone Source Separation , 2000, NIPS.

[86]  W. Freeman,et al.  The Weizmann Institute of Science , 1959, Nature.