Dynamic noise aware training for speech enhancement based on deep neural networks

We propose three algorithms to address the mismatch problem in deep neural network (DNN) based speech enhancement. First, we investigate noise aware training by incorporating noise informationin the testutterance with anideal binary maskbased dynamic noise estimation approach to improve DNN’s speech separation ability from the noisy signal. Next, a set of more than 100 noise types is adopted to enrich the generalization capabilities of the DNN to unseen and non-stationary noise conditions. Finally, the quality of the enhanced speech can further be improved by global variance equalization. Empirical results show that each of the three proposed techniques contributes to the performance improvement. Compared to the conventional logarithmic minimum mean squared error speech enhancement method, our DNN system achieves 0.32 PESQ (perceptual evaluation of speech quality) improvement across six signal-tonoise ratio levels ranging from -5dB to 20dB on a test set with unknown noise types. We also observe that the combined strategies can well suppress highly non-stationary noise better than all the competing state-of-the-art techniques we have evaluated. Index Terms: Speech enhancement, deep neural networks, noise aware training, ideal binary mask, non-stationary noise

[1]  Alex T. NELSONOregon Networks for Speech Enhancement , 1998 .

[2]  Israel Cohen,et al.  Speech enhancement for non-stationary noise environments , 2001, Signal Process..

[3]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[4]  Geoffrey E. Hinton,et al.  Binary coding of speech spectrograms using a deep auto-encoder , 2010, INTERSPEECH.

[5]  Jacob Benesty,et al.  Speech Enhancement , 2010 .

[6]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[7]  D. Wang,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2008, IEEE Trans. Neural Networks.

[8]  Quoc V. Le,et al.  Recurrent Neural Networks for Noise Reduction in Robust ASR , 2012, INTERSPEECH.

[9]  A.V. Oppenheim,et al.  Enhancement and bandwidth compression of noisy speech , 1979, Proceedings of the IEEE.

[10]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[11]  Andrew L. Maas,et al.  RECURRENT NEURAL NETWORK FEATURE ENHANCEMENT: THE 2nd CHIME CHALLENGE , 2013 .

[12]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Israel Cohen,et al.  Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging , 2003, IEEE Trans. Speech Audio Process..

[14]  Hui Jiang,et al.  Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Richard C. Hendriks,et al.  Unbiased MMSE-Based Noise Power Estimation With Low Complexity and Low Tracking Delay , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  John H. L. Hansen,et al.  Speech Enhancement Based on Generalized Minimum Mean Square Error Estimators and Masking Properties of the Auditory System , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Changchun Bao,et al.  Wiener filtering based speech enhancement with Weighted Denoising Auto-encoder and noise classification , 2014, Speech Commun..

[18]  DeLiang Wang,et al.  Exploring Monaural Features for Classification-Based Speech Segregation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Methods for objective and subjective assessment of quality Perceptual evaluation of speech quality ( PESQ ) : An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs , 2002 .

[20]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[21]  Philipos C. Loizou,et al.  A noise-estimation algorithm for highly non-stationary environments , 2006, Speech Commun..

[22]  WangDeLiang,et al.  Towards Scaling Up Classification-Based Speech Separation , 2013 .

[23]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[24]  Yongqiang Wang,et al.  An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25]  Jun Du,et al.  A speech enhancement approach using piecewise linear approximation of an explicit model of environmental distortions , 2008, INTERSPEECH.

[26]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[27]  VargaAndrew,et al.  Assessment for automatic speech recognition II , 1993 .

[28]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[29]  Jesper Jensen,et al.  MMSE based noise PSD tracking with low complexity , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[30]  S. King,et al.  In Proc. Interspeech , 2009 .

[31]  Sandhya Hawaldar,et al.  Speech Enhancement for Nonstationary Noise Environments , 2011 .

[32]  Changchun Bao,et al.  Speech enhancement with weighted denoising auto-encoder , 2013, INTERSPEECH.