Perceptual improvement of deep neural networks for monaural speech enhancement

Monaural speech enhancement is a key yet challenging problem for many important real world applications. Recently, deep neural networks(DNNs)-based speech enhancement methods, which extract useful feature from complex feature, have demonstrated remarkable performance improvement. In this paper, we present a novel DNN architecture for monaural speech enhancement. Taking into account the masking properties of the human auditory system, a piecewise gain function is applied in the proposed DNN architecture, which is used to reduce the noise and make the residual noise perceptually inaudible. The proposed architecture jointly optimize the piecewise gain function and DNN. Systematic experiments on TIMIT corpus with 20 noise types at various signal-to-noise ratio (SNR) conditions demonstrate the superiority of the proposed DNN over the reference speech enhancement methods, no matter in the matched noise conditions or in the unmatched noise conditions.

[1]  Paris Smaragdis,et al.  Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Kuldip K. Paliwal,et al.  Single-channel speech enhancement using spectral subtraction in the short-time modulation domain , 2010, Speech Commun..

[4]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[5]  Douglas D. O'Shaughnessy,et al.  Speech enhancement based on novel two-step a priori SNR estimators , 2008, INTERSPEECH.

[6]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  DeLiang Wang,et al.  Complex Ratio Masking for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Eliathamby Ambikairajah,et al.  Perceptual speech enhancement exploiting temporal masking properties of human auditory system , 2010, Speech Commun..

[9]  Ahmed Tamtaoui,et al.  Perceptual improvement of Wiener filtering , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Meng Sun,et al.  Speech Enhancement Under Low SNR Conditions Via Noise Estimation Using Sparse and Low-Rank NMF with Kullback–Leibler Divergence , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Quoc V. Le,et al.  On optimization methods for deep learning , 2011, ICML.

[12]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[13]  A.V. Oppenheim,et al.  Enhancement and bandwidth compression of noisy speech , 1979, Proceedings of the IEEE.

[14]  Thomas Fang Zheng,et al.  Unseen Noise Estimation Using Separable Deep Auto Encoder for Speech Enhancement , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Jun Du,et al.  A speech enhancement approach using piecewise linear approximation of an explicit model of environmental distortions , 2008, INTERSPEECH.

[16]  Paris Smaragdis,et al.  Supervised and Unsupervised Speech Enhancement Using Nonnegative Matrix Factorization , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Richard C. Hendriks,et al.  Unbiased MMSE-Based Noise Power Estimation With Low Complexity and Low Tracking Delay , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Yi Hu,et al.  Incorporating a psychoacoustical model in frequency domain speech enhancement , 2004, IEEE Signal Processing Letters.