A Regression Approach to Speech Enhancement Based on Deep Neural Networks

In contrast to the conventional minimum mean square error (MMSE)-based noise reduction techniques, we propose a supervised method to enhance speech by means of finding a mapping function between noisy and clean speech signals based on deep neural networks (DNNs). In order to be able to handle a wide range of additive noises in real-world situations, a large training set that encompasses many possible combinations of speech and noise types, is first designed. A DNN architecture is then employed as a nonlinear regression function to ensure a powerful modeling capability. Several techniques have also been proposed to improve the DNN-based speech enhancement system, including global variance equalization to alleviate the over-smoothing problem of the regression model, and the dropout and noise-aware training strategies to further improve the generalization capability of DNNs to unseen noise conditions. Experimental results demonstrate that the proposed framework can achieve significant improvements in both objective and subjective measures over the conventional MMSE based technique. It is also interesting to observe that the proposed DNN approach can well suppress highly nonstationary noise, which is tough to handle in general. Furthermore, the resulting DNN model, trained with artificial synthesized data, is also effective in dealing with noisy speech data recorded in real-world scenarios without the generation of the annoying musical artifact commonly observed in conventional enhancement methods.

[1]  Guy J. Brown,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2006 .

[2]  Li-Rong Dai,et al.  Voice Conversion Using Deep Neural Networks With Layer-Wise Generative Training , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Pascal Scalart,et al.  Speech enhancement based on a priori signal to noise estimation , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[4]  Israel Cohen,et al.  Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging , 2003, IEEE Trans. Speech Audio Process..

[5]  Yoshua Bengio,et al.  Exploring Strategies for Training Deep Neural Networks , 2009, J. Mach. Learn. Res..

[6]  Alex T. NELSONOregon Networks for Speech Enhancement , 1998 .

[7]  Philipos C. Loizou,et al.  A noise-estimation algorithm for highly non-stationary environments , 2006, Speech Commun..

[8]  José L. Pérez-Córdoba,et al.  Histogram equalization of speech representation for robust speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[9]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[10]  Yongqiang Wang,et al.  An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  R. Plackett,et al.  THE DESIGN OF OPTIMUM MULTIFACTORIAL EXPERIMENTS , 1946 .

[13]  J. I The Design of Experiments , 1936, Nature.

[14]  Olivier Cappé,et al.  Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor , 1994, IEEE Trans. Speech Audio Process..

[15]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[16]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[17]  Björn W. Schuller,et al.  Feature enhancement by bidirectional LSTM networks for conversational speech recognition in highly non-stationary noise , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  DeLiang Wang,et al.  Towards Scaling Up Classification-Based Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[20]  DeLiang Wang,et al.  Joint noise adaptive training for robust automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Francesco Piazza,et al.  Nonlinear Speech Enhancement: An Overview , 2005, WNSP.

[22]  S. Tamura,et al.  An analysis of a noise reduction neural network , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[23]  Birger Kollmeier,et al.  SNR estimation based on amplitude modulation analysis with applications to noise suppression , 2003, IEEE Trans. Speech Audio Process..

[24]  DeLiang Wang,et al.  An algorithm to improve speech recognition in noise for hearing-impaired listeners. , 2013, The Journal of the Acoustical Society of America.

[25]  DeLiang Wang,et al.  Ideal ratio mask estimation using deep neural networks for robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[26]  Parham Aarabi,et al.  On the importance of phase in human speech recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  DeLiang Wang,et al.  Investigation of Speech Separation as a Front-End for Noise Robust Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[28]  Keiichi Tokuda,et al.  Spectral conversion based on maximum likelihood estimation considering global variance of converted parameter , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[29]  Changchun Bao,et al.  Speech enhancement with weighted denoising auto-encoder , 2013, INTERSPEECH.

[30]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[31]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[32]  Yi Hu,et al.  A noise estimation algorithm with rapid adaptation for highly nonstationary environments , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[33]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[34]  Geoffrey E. Hinton,et al.  Binary coding of speech spectrograms using a deep auto-encoder , 2010, INTERSPEECH.

[35]  Sriram Srinivasan,et al.  Knowledge-Based Speech Enhancement , 2005 .

[36]  Dong Yu,et al.  Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  DeLiang Wang,et al.  Boosted deep neural networks and multi-resolution cochleagram features for voice activity detection , 2014, INTERSPEECH.

[38]  Jun Du,et al.  A speech enhancement approach using piecewise linear approximation of an explicit model of environmental distortions , 2008, INTERSPEECH.

[39]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[40]  Quoc V. Le,et al.  Recurrent Neural Networks for Noise Reduction in Robust ASR , 2012, INTERSPEECH.

[41]  A.V. Oppenheim,et al.  Enhancement and bandwidth compression of noisy speech , 1979, Proceedings of the IEEE.

[42]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[43]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[44]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[45]  Hui Jiang,et al.  Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[46]  Ning Ma,et al.  The CHiME corpus: a resource and a challenge for computational hearing in multisource environments , 2010, INTERSPEECH.

[47]  Israel Cohen,et al.  Speech enhancement for non-stationary noise environments , 2001, Signal Process..

[48]  Jacob Benesty,et al.  Speech Enhancement , 2010 .

[49]  Dirk Van Compernolle,et al.  A family of MLP based nonlinear spectral estimators for noise reduction , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[50]  Yoshua Bengio,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2010, AISTATS.