Phase and reverberation aware DNN for distant-talking speech enhancement

Enhancing reverberant speech with Deep Neural Networks (DNNs) is an interesting yet challenging topic. The performance of speech enhancement degrades significantly when test and training conditions are mismatched. In this paper we propose a Static Reverberation Aware Training (SRAT)-based dereverberation through which the reverberation estimate is obtained by averaging over broken down frame. This method significantly reduces the input dimensions of the and enables the DNN to learn the relations between clean and reverberant speech more efficiently. Most speech enhancement approaches ignore phase information due to its complicated structure. As phase correlates closely to speech signal we exploited this relationship to achieve better performance using DNN. Phase information was augmented with magnitude information and used as the input for DNN. We denote this method as phase aware DNN. Finally, both phase information and reverberation were added to reverberant speech to achieve better speech enhancement performance in a distant-talking condition. Features of the reverberant speech, phase and reverberation were used during the training and testing stages. This is because the DNN could use both reverberation and phase information to better generalize the speech signal. The proposed method was evaluated using the REVERB CHALLENGE 2014 database. Results are significantly improved results with respect to both reconstructed speech quality (PESQ: Perceptual Evaluation of Speech Quality) and influence of reverberation (SRMR: Speech to Reverberation Modulation Energy Ratio). As compared to the conventional DNN-based approach, this proposed one improved SRMR from 4.84 to 5.92 and PESQ from 2.34 to 2.70, indicating that our proposed method could efficiently enhance speech severely corrupted by reverberation.

[1]  Sridha Sridharan,et al.  JFA based speaker recognition using delta-phase and MFCC features , 2012 .

[2]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[3]  Wang Longbiao,et al.  Investigation of DNN Based Distant-Talking Speech Enhancement , 2015 .

[4]  Yongqiang Wang,et al.  An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Yariv Ephraim,et al.  A signal subspace approach for speech enhancement , 1995, IEEE Trans. Speech Audio Process..

[6]  Seiichi Nakagawa,et al.  PAPER Special Section on Processing Natural Speech Variability for Improved Verbal Human-Computer Interaction Speaker Recognition by Combining MFCC and Phase Information in Noisy Conditions , 2010 .

[7]  Jacob Benesty,et al.  Speech Enhancement , 2010 .

[8]  Longbiao Wang,et al.  Speaker Identification and Verification by Combining MFCC and Phase Information , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Mohamed-Slim Alouini,et al.  Instantly decodable network coding for real-time device-to-device communications , 2016, EURASIP J. Adv. Signal Process..

[10]  Jun Du,et al.  Dynamic noise aware training for speech enhancement based on deep neural networks , 2014, INTERSPEECH.

[11]  Biing-Hwang Juang,et al.  Blind speech dereverberation with multi-channel linear prediction based on short time fourier transform representation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Tomohiro Nakatani,et al.  The reverb challenge: A common evaluation framework for dereverberation and recognition of reverberant speech , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[13]  Bo Ren,et al.  Environment-dependent denoising autoencoder for distant-talking speech recognition , 2015, EURASIP Journal on Advances in Signal Processing.

[14]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[15]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[16]  Birger Kollmeier,et al.  SNR estimation based on amplitude modulation analysis with applications to noise suppression , 2003, IEEE Trans. Speech Audio Process..

[17]  Tomohiro Nakatani,et al.  Suppression of Late Reverberation Effect on Speech Signal Using Long-Term Multiple-step Linear Prediction , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Xiaoli Ma,et al.  Eigenvector-based initial ranging process for OFDMA uplink systems , 2015, EURASIP Journal on Advances in Signal Processing.

[19]  Yu Tsao,et al.  Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[20]  J. Foote,et al.  WSJCAM0: A BRITISH ENGLISH SPEECH CORPUS FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION , 1995 .

[21]  Steve Renals,et al.  WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[22]  Haizhou Li,et al.  Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation , 2016, EURASIP J. Adv. Signal Process..

[23]  Longbiao Wang,et al.  Relative phase information for detecting human speech and spoofed speech , 2015, INTERSPEECH.

[24]  Rajesh M. Hegde,et al.  Significance of the Modified Group Delay Feature in Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .