Recurrent Neural Networks for Cochannel Speech Separation in Reverberant Environments

Speech separation is a fundamental problem in speech and signal processing. A particular challenge is monaural separation of cochannel speech, or a two-talker mixture, in a reverberant environment. In this paper, we study recurrent neural networks (RNNs) with long short-term memory (LSTM) in separating and enhancing speech signals in reverberant cochannel mixtures. Our investigation shows that RNNs are effective in separating reverberant speech signals. In addition, RNNs significantly outperform deep feedforward networks based on objective speech intelligibility and quality measures. We also find that the best performance is achieved when the ideal ratio mask (IRM) is used as the training target in comparison with alternative training targets. While trained using reverberant signals generated by simulated room impulse responses (RIRs), our model generalizes well to conditions where the signals are generated by recorded RIRs.

[1]  DeLiang Wang,et al.  A two-stage algorithm for noisy and reverberant speech enhancement , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  T W Tillman,et al.  Interaction of competing speech signals with hearing losses. , 1970, Archives of otolaryngology.

[3]  R. Plomp,et al.  Effects of fluctuating noise and interfering speech on the speech-reception threshold for impaired and normal hearing. , 1990, The Journal of the Acoustical Society of America.

[4]  DeLiang Wang,et al.  Towards Scaling Up Classification-Based Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Paris Smaragdis,et al.  Deep learning for monaural speech separation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  IEEE Recommended Practice for Speech Quality Measurements , 1969, IEEE Transactions on Audio and Electroacoustics.

[7]  DeLiang Wang,et al.  A Deep Ensemble Learning Method for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Zhuo Chen,et al.  Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Nima Mesgarani,et al.  Deep attractor network for single-microphone speaker separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Jun Du,et al.  Speech separation of a target speaker based on deep neural networks , 2014, 2014 12th International Conference on Signal Processing (ICSP).

[11]  Oldooz Hazrati,et al.  Tackling the combined effects of reverberation and masking noise using ideal channel selection. , 2012, Journal of speech, language, and hearing research : JSLHR.

[12]  DeLiang Wang,et al.  Incorporating Auditory Feature Uncertainties in Robust Speaker Identification , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[13]  Tao Zhang,et al.  DNN-based enhancement of noisy and reverberant speech , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[15]  Richard M. Stern,et al.  Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Jonathan Le Roux,et al.  Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Tim Brookes,et al.  Dynamic Precedence Effect Modeling for Source Separation in Reverberant Environments , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19]  G. K.,et al.  Learning Spectral Mapping for Speech Dereverberation and Denoising , 2017 .

[20]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[21]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[22]  Björn W. Schuller,et al.  Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR , 2015, LVA/ICA.

[23]  Dong Yu,et al.  Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  Paris Smaragdis,et al.  Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26]  DeLiang Wang,et al.  Long short-term memory for speaker generalization in supervised speech separation. , 2017, The Journal of the Acoustical Society of America.

[27]  L D Braida,et al.  Intelligibility of conversational and clear speech in noise and reverberation for listeners with normal and impaired hearing. , 1994, The Journal of the Acoustical Society of America.

[28]  DeLiang Wang,et al.  Features for Masking-Based Monaural Speech Separation in Reverberant Conditions , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[29]  DeLiang Wang,et al.  Learning spectral mapping for speech dereverberation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).