A Novel Training Strategy Using Dynamic Data Generation for Deep Neural Network Based Speech Enhancement

In this paper, a new training strategy is proposed to address the key issue in deep neural network (DNN) based speech enhancement: how to effectively utilize the limited data with a growing awareness of the necessity to increase training data in the deep learning era. Traditionally, a fixed training set consisting of a large amount of paired utterances, i.e., clean speech and corresponding noisy speech, must be prepared in advance. However, it seems inevitable to enlarge noisy speech in the training stage for making model adaptive to various noise environments. Besides, involving more training data leads to longer training time as the fixed training set should be repeated for multiple epochs. In this study, we propose a novel training method via dynamic data generation. The key idea is the synthetic phase of noisy speech data is conducted on the fly from utterance level to the batch level. Three advantages are gained from this new training method. First, by dynamic generation of training data batch, it is not necessary to prepare and store the fixed training set as in the conventional training method. Second, with the same training time as in the conventional method, more abundant noisy data are actually fed into DNN model. Finally, different evaluation measures, including PESQ, STOI, LSD, and SegSNR, can be consistently improved for the unseen noise types, demonstrating the better generalization capability of the proposed training strategy.

[1]  Yu Tsao,et al.  Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[2]  Gerald Friedland,et al.  Multimodal Signal Processing: Speaker diarization , 2012 .

[3]  Changchun Bao,et al.  Wiener filtering based speech enhancement with Weighted Denoising Auto-encoder and noise classification , 2014, Speech Commun..

[4]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Paris Smaragdis,et al.  Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7]  Björn W. Schuller,et al.  Single-channel speech separation with memory-enhanced recurrent neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Franz Pernkopf,et al.  Representation Learning for Single-Channel Source Separation and Bandwidth Extension , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Alan V. Oppenheim,et al.  All-pole modeling of degraded speech , 1978 .

[10]  Brian Kingsbury,et al.  Recognizing reverberant speech with RASTA-PLP , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice, Second Edition , 2013 .

[12]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[13]  Jun Du,et al.  SNR-Based Progressive Learning of Deep Neural Network for Speech Enhancement , 2016, INTERSPEECH.

[14]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[15]  Jesper Jensen,et al.  An evaluation of objective measures for intelligibility prediction of time-frequency weighted noisy speech. , 2011, The Journal of the Acoustical Society of America.