The Conversation: Deep Audio-Visual Speech Enhancement

Our goal is to isolate individual speakers from multi-talker simultaneous speech in videos. Existing works in this area have focussed on trying to separate utterances from known speakers in controlled environments. In this paper, we propose a deep audio-visual speech enhancement network that is able to separate a speaker's voice given lip regions in the corresponding video, by predicting both the magnitude and the phase of the target signal. The method is applicable to speakers unheard and unseen during training, and for unconstrained environments. We demonstrate strong quantitative and qualitative results, isolating extremely challenging real-world examples.

[1]  Jae Lim,et al.  Signal estimation from modified short-time Fourier transform , 1984 .

[2]  J L Schwartz,et al.  Audio-visual enhancement of speech in noise. , 2001, The Journal of the Acoustical Society of America.

[3]  John R. Hershey,et al.  Audio-Visual Sound Separation Via Hidden Markov Models , 2001, NIPS.

[4]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[5]  Chalapathy Neti,et al.  Audio-visual speech enhancement with AVCDCN (audio-visual codebook dependent cepstral normalization) , 2002, Sensor Array and Multichannel Signal Processing Workshop Proceedings, 2002.

[6]  Chalapathy Neti,et al.  Noisy audio feature enhancement using audio-visual speech data , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Nebojsa Jojic,et al.  Audio-visual graphical models for speech processing , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Saeid Sanei,et al.  Video assisted speech source separation , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[9]  Richard M. Dansereau,et al.  Single-Channel Speech Separation Using Soft Mask Filtering , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Bhiksha Raj,et al.  Soft Mask Methods for Single-Channel Speaker Separation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Te-Won Lee,et al.  Blind Speech Separation , 2007, Blind Speech Separation.

[12]  Ben P. Milner,et al.  Using audio-visual features for robust voice activity detection in clean and noisy speech , 2008, 2008 16th European Signal Processing Conference.

[13]  DeLiang Wang,et al.  A Supervised Learning Approach to Monaural Segregation of Reverberant Speech , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Ben P. Milner,et al.  Effective visually-derived Wiener filtering for audio-visual speech processing , 2009, AVSP.

[15]  M. A. Anusuya,et al.  Speech Recognition by Machine, A Review , 2010, ArXiv.

[16]  Emmanuel Vincent,et al.  Subjective and Objective Quality Assessment of Audio Source Separation , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Josef Kittler,et al.  Source Separation of Convolutive and Noisy Mixtures Using Audio-Visual Dictionary Learning and Probabilistic Time-Frequency Masking , 2013, IEEE Transactions on Signal Processing.

[19]  Faheem Khan,et al.  Speaker separation using visually-derived binary masks , 2013, AVSP.

[20]  Jonathon A. Chambers,et al.  Audiovisual Speech Source Separation: An overview of key methodologies , 2014, IEEE Signal Processing Magazine.

[21]  Pejman Mowlaee,et al.  Phase Estimation in Single-Channel Speech Enhancement: Limits-Potential , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Andreas Gaich,et al.  On speech intelligibility estimation of phase-aware single-channel speech enhancement , 2015, INTERSPEECH.

[23]  Shimon Whiteson,et al.  LipNet: End-to-End Sentence-level Lipreading , 2016, 1611.01599.

[24]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  DeLiang Wang,et al.  Complex ratio masking for joint enhancement of magnitude and phase , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Joon Son Chung,et al.  Lip Reading in the Wild , 2016, ACCV.

[27]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[28]  Franz Pernkopf,et al.  Phase-Aware Signal Processing for Automatic Speech Recognition , 2016, INTERSPEECH.

[29]  Joon Son Chung,et al.  Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[30]  Yannis Stylianou,et al.  Advances in phase-aware signal processing in speech communication , 2016, Speech Commun..

[31]  Shimon Whiteson,et al.  LipNet: Sentence-level Lipreading , 2016, ArXiv.

[32]  Yu Tsao,et al.  Complex spectrogram enhancement by convolutional neural network with multi-metrics learning , 2017, 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).

[33]  Shmuel Peleg,et al.  Improved Speech Reconstruction from Silent Video , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[34]  Themos Stafylakis,et al.  Combining Residual Networks with LSTMs for Lipreading , 2017, INTERSPEECH.

[35]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Ira Kemelmacher-Shlizerman,et al.  Synthesizing Obama , 2017, ACM Trans. Graph..

[37]  Michael Gref,et al.  On the Influence of Modifying Magnitude and Phase Spectrum to Enhance Noisy Speech Signals , 2017, INTERSPEECH.

[38]  Joon Son Chung,et al.  Lip Reading in Profile , 2017, BMVC.

[39]  Yu Tsao,et al.  Multi-Metrics Learning for Speech Enhancement , 2017, ArXiv.

[40]  Yu Tsao,et al.  Audio-Visual Speech Enhancement based on Multimodal Deep Convolutional Neural Network , 2017, ArXiv.

[41]  Garrett T. Kenyon,et al.  Does Phase Matter For Monaural Source Separation? , 2017, ArXiv.

[42]  Joon Son Chung,et al.  You said that? , 2017, BMVC.

[43]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Shmuel Peleg,et al.  Visual Speech Enhancement using Noise-Invariant Training , 2017, ArXiv.

[45]  Andrew Owens,et al.  Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[46]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[47]  Maja Pantic,et al.  End-to-End Audiovisual Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[48]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[49]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[50]  Yu Tsao,et al.  Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks , 2017, IEEE Transactions on Emerging Topics in Computational Intelligence.

[51]  Shmuel Peleg,et al.  Seeing Through Noise: Visually Driven Speaker Separation And Enhancement , 2017, ICASSP.

[52]  Shmuel Peleg,et al.  Visual Speech Enhancement , 2017, INTERSPEECH.