Joint separation and denoising of noisy multi-talker speech using recurrent neural networks and permutation invariant training

In this paper we propose to use utterance-level Permutation Invariant Training (uPIT) for speaker independent multi-talker speech separation and denoising, simultaneously. Specifically, we train deep bi-directional Long Short-Term Memory (LSTM) Recurrent Neural Networks (RNNs) using uPIT, for single-channel speaker independent multi-talker speech separation in multiple noisy conditions, including both synthetic and real-life noise signals. We focus our experiments on generalizability and noise robustness of models that rely on various types of a priori knowledge e.g. in terms of noise type and number of simultaneous speakers. We show that deep bi-directional LSTM RNNs trained using uPIT in noisy environments can improve the Signal-to-Distortion Ratio (SDR) as well as the Extended Short-Time Objective Intelligibility (ESTOI) measure, on the speaker independent multi-talker speech separation and denoising task, for various noise types and Signal-to-Noise Ratios (SNRs). Specifically, we first show that LSTM RNNs can achieve large SDR and ESTOI improvements, when evaluated using known noise types, and that a single model is capable of handling multiple noise types with only a slight decrease in performance. Furthermore, we show that a single LSTM RNN can handle both two-speaker and three-speaker noisy mixtures, without a priori knowledge about the exact number of speakers. Finally, we show that LSTM RNNs trained using uPIT generalize well to noise types not seen during training.

[1]  Dong Yu,et al.  Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2]  Björn W. Schuller,et al.  Discriminatively trained recurrent neural networks for single-channel speech separation , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[3]  Jesper Jensen,et al.  Minimum Mean-Square Error Estimation of Discrete Fourier Coefficients With Generalized Gamma Priors , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Subhashini Venugopalan,et al.  Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. , 2016, JAMA.

[5]  Jun Du,et al.  Speech separation of a target speaker based on deep neural networks , 2014, 2014 12th International Conference on Signal Processing (ICSP).

[6]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[7]  Xiaodong Cui,et al.  English Conversational Telephone Speech Recognition by Humans and Machines , 2017, INTERSPEECH.

[8]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Jessica J. M. Monaghan,et al.  Speech enhancement based on neural networks improves speech intelligibility in noise for cochlear implant users , 2017, Hearing Research.

[10]  DeLiang Wang,et al.  Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises. , 2016, The Journal of the Acoustical Society of America.

[11]  Zhuo Chen,et al.  Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Nima Mesgarani,et al.  Deep attractor network for single-microphone speaker separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Geoffrey Zweig,et al.  Achieving Human Parity in Conversational Speech Recognition , 2016, ArXiv.

[14]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[15]  Dong Yu,et al.  Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  DeLiang Wang,et al.  Long short-term memory for speaker generalization in supervised speech separation. , 2017, The Journal of the Acoustical Society of America.

[18]  Josh H. McDermott The cocktail party problem , 2009, Current Biology.

[19]  Geoffrey Zweig,et al.  An introduction to computational networks and the computational network toolkit (invited talk) , 2014, INTERSPEECH.

[20]  Jesper Jensen,et al.  Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Jesper Jensen,et al.  An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Jesper Jensen,et al.  Speech Intelligibility Potential of General and Specialized Deep Neural Network Based Speech Enhancement Systems , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  Björn W. Schuller,et al.  Single-channel speech separation with memory-enhanced recurrent neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[25]  KimMinje,et al.  Joint optimization of masks and deep recurrent neural networks for monaural source separation , 2015 .

[26]  Sebastian Thrun,et al.  Dermatologist-level classification of skin cancer with deep neural networks , 2017, Nature.

[27]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[28]  Paris Smaragdis,et al.  Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[29]  Jonathan Le Roux,et al.  Single-Channel Multi-Speaker Separation Using Deep Clustering , 2016, INTERSPEECH.

[30]  DeLiang Wang,et al.  An algorithm to increase speech intelligibility for hearing-impaired listeners in novel segments of the same noise type. , 2015, The Journal of the Acoustical Society of America.

[31]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[32]  Jonathan Le Roux,et al.  Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Dong Yu,et al.  Deep Neural Networks for Single-Channel Multi-Talker Speech Recognition , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[34]  Jonathan Le Roux,et al.  Deep Recurrent Networks for Separation and Recognition of Single-Channel Speech in Nonstationary Background Audio , 2017, New Era for Robust Speech Recognition, Exploiting Deep Learning.