Audio–Visual Deep Clustering for Speech Separation

Speech separation aims to separate individual voices from an audio mixture of multiple simultaneous talkers. Audio-only approaches show unsatisfactory performance when the speakers are of the same gender or share similar voice characteristics. This is due to challenges on learning appropriate feature representations for separating voices in single frames and streaming voices across time. Visual signals of speech (e.g., lip movements), if available, can be leveraged to learn better feature representations for separation. In this paper, we propose a novel audio–visual deep clustering model (AVDC) to integrate visual information into the process of learning better feature representations (embeddings) for Time–Frequency (T–F) bin clustering. It employs a two-stage audio–visual fusion strategy where speaker-wise audio–visual T–F embeddings are first computed after the first-stage fusion to model the audio–visual correspondence for each speaker. In the second-stage fusion, audio–visual embeddings of all speakers and audio embeddings calculated by deep clustering from the audio mixture are concatenated to form the final T–F embedding for clustering. Through a series of experiments, the proposed AVDC model is shown to outperform the audio-only deep clustering and utterance-level permutation invariant training baselines and three other state-of-the-art audio–visual approaches. Further analyses show that the AVDC model learns a better T–F embedding for alleviating the source permutation problem across frames. Other experiments show that the AVDC model is able to generalize across different numbers of speakers between training and testing and shows some robustness when visual information is partially missing.

[1]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[2]  Jonathan Le Roux,et al.  Single-Channel Multi-Speaker Separation Using Deep Clustering , 2016, INTERSPEECH.

[3]  Zhong-Qiu Wang,et al.  End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction , 2018, INTERSPEECH.

[4]  Amirsina Torfi,et al.  3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition , 2017, IEEE Access.

[5]  Shmuel Peleg,et al.  Seeing Through Noise: Speaker Separation and Enhancement using Visually-derived Speech , 2017, ArXiv.

[6]  Jesper Jensen,et al.  Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Guy J. Brown,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2006 .

[8]  John R. Hershey,et al.  Super-human multi-talker speech recognition: A graphical modeling approach , 2010, Comput. Speech Lang..

[9]  Christian Jutten,et al.  Visual voice activity detection as a help for speech source separation from convolutive mixtures , 2007, Speech Commun..

[10]  Patrick Pérez,et al.  Motion informed audio source separation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Changshui Zhang,et al.  Listen and Look: Audio–Visual Matching Assisted Speech Source Separation , 2018, IEEE Signal Processing Letters.

[12]  Saeid Sanei,et al.  Video assisted speech source separation , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[13]  Dalia El Badawy,et al.  On-the-fly audio source separation , 2014, 2014 IEEE International Workshop on Machine Learning for Signal Processing (MLSP).

[14]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[16]  Dong Yu,et al.  Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[19]  Zhong-Qiu Wang,et al.  Alternative Objective Functions for Deep Clustering , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[21]  Michael I. Jordan,et al.  Factorial Hidden Markov Models , 1995, Machine Learning.

[22]  Davis E. King,et al.  Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[23]  Nima Mesgarani,et al.  Speaker-Independent Speech Separation With Deep Attractor Network , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24]  Bhiksha Raj,et al.  Supervised and Semi-supervised Separation of Sounds from Single-Channel Mixtures , 2007, ICA.

[25]  Andrew Owens,et al.  Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Pierre Vandergheynst,et al.  Blind Audiovisual Source Separation Based on Sparse Redundant Representations , 2010, IEEE Transactions on Multimedia.

[27]  Jonathon A. Chambers,et al.  Audiovisual Speech Source Separation: An overview of key methodologies , 2014, IEEE Signal Processing Magazine.

[28]  Ce Liu,et al.  Exploring new representations and applications for motion analysis , 2009 .

[29]  Yu Tsao,et al.  Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks , 2017, IEEE Transactions on Emerging Topics in Computational Intelligence.

[30]  Tuomas Virtanen,et al.  Speech recognition using factorial hidden Markov models for separation in the feature space , 2006, INTERSPEECH.

[31]  Joon Son Chung,et al.  Lip Reading in the Wild , 2016, ACCV.

[32]  Gregory B. Cogan,et al.  Visual Input Enhances Selective Speech Envelope Tracking in Auditory Cortex at a “Cocktail Party” , 2013, The Journal of Neuroscience.

[33]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Wei Ji Ma,et al.  Lip-Reading Aids Word Recognition Most in Moderate Noise: A Bayesian Explanation Using High-Dimensional Feature Space , 2009, PloS one.

[35]  Christian Jutten,et al.  Separation of Audio-Visual Speech Sources: A New Approach Exploiting the Audio-Visual Coherence of Speech Stimuli , 2002, EURASIP J. Adv. Signal Process..

[36]  Zhuo Chen,et al.  Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Nima Mesgarani,et al.  Deep attractor network for single-microphone speaker separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Josef Kittler,et al.  Source Separation of Convolutive and Noisy Mixtures Using Audio-Visual Dictionary Learning and Probabilistic Time-Frequency Masking , 2013, IEEE Transactions on Signal Processing.

[39]  Paris Smaragdis,et al.  Deep learning for monaural speech separation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Philip J. B. Jackson,et al.  Use of bimodal coherence to resolve the permutation problem in convolutive BSS , 2012, Signal Process..

[41]  DeLiang Wang,et al.  An Unsupervised Approach to Cochannel Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[42]  Naomi Harte,et al.  TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech , 2015, IEEE Transactions on Multimedia.

[43]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[44]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[45]  Richard G. Baraniuk,et al.  k-POD: A Method for k-Means Clustering of Missing Data , 2014, 1411.7013.

[46]  Paris Smaragdis,et al.  Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[47]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[48]  Christian Jutten,et al.  Mixing Audiovisual Speech Processing and Blind Source Separation for the Extraction of Speech Signals From Convolutive Mixtures , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[49]  Shmuel Peleg,et al.  Visual Speech Enhancement , 2017, INTERSPEECH.

[50]  Olivier Cappé,et al.  Soft Nonnegative Matrix Co-Factorization , 2014, IEEE Transactions on Signal Processing.

[51]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Trevor Darrell,et al.  Learning Features by Watching Objects Move , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[54]  Chuang Gan,et al.  The Sound of Pixels , 2018, ECCV.

[55]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[56]  Patrick Pérez,et al.  Identify, Locate and Separate: Audio-Visual Object Extraction in Large Video Collections Using Weak Supervision , 2018, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[57]  Joon Son Chung,et al.  The Conversation: Deep Audio-Visual Speech Enhancement , 2018, INTERSPEECH.

[58]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[59]  Paris Smaragdis,et al.  Convolutive Speech Bases and Their Application to Supervised Speech Separation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.