Perception Optimized Deep Denoising AutoEncoders for Speech Enhancement

Speech Enhancement is a challenging and important area of research due to the many applications that depend on improved signal quality. It is a pre-processing step of speech processing systems and used for perceptually improving quality of speech for humans. With recent advances in Deep Neural Networks (DNN), deep Denoising Auto-Encoders have proved to be very successful for speech enhancement. In this paper, we propose a novel objective loss function, which takes into account the perceptual quality of speech. We use that to train PerceptuallyOptimized Speech Denoising Auto-Encoders (POS-DAE). We demonstrate the effectiveness of POS-DAE in a speech enhancement task. Further we introduce a two level DNN architecture for denoising and enhancement. We show the effectiveness of the proposed methods for a high noise subset of the QUT-NOISE-TIMIT database under mismatched noise conditions. Experiments are conducted comparing the POS-DAE against the Mean Square Error loss function using speech distortion, noise reduction and Perceptual Evaluation of Speech Quality. We find that the proposed loss function and the new 2stage architecture give significant improvements in perceptual speech quality measures and the improvements become more significant for higher noise conditions.

[1]  Jon Rigelsford,et al.  Handbook of Neural Networks for Speech Processing , 2003 .

[2]  George Carayannis,et al.  Speech enhancement from noise: A regenerative approach , 1991, Speech Commun..

[3]  References , 1971 .

[4]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[5]  Yongqiang Wang,et al.  An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[7]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[8]  Yu Tsao,et al.  Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[9]  Moncef Gabbouj,et al.  Ways to Implement Global Variance in Statistical Speech Synthesis , 2012, INTERSPEECH.

[10]  Jun Du,et al.  Global variance equalization for improving deep neural network based speech enhancement , 2014, 2014 IEEE China Summit & International Conference on Signal and Information Processing (ChinaSIP).

[11]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[12]  M. Nakamura,et al.  Improvements to the noise reduction neural network , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[13]  Quoc V. Le,et al.  Recurrent Neural Networks for Noise Reduction in Robust ASR , 2012, INTERSPEECH.

[14]  A.V. Oppenheim,et al.  Enhancement and bandwidth compression of noisy speech , 1979, Proceedings of the IEEE.

[15]  Changchun Bao,et al.  Speech enhancement with weighted denoising auto-encoder , 2013, INTERSPEECH.

[16]  Björn W. Schuller,et al.  Feature enhancement by bidirectional LSTM networks for conversational speech recognition in highly non-stationary noise , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Richard M. Schwartz,et al.  Enhancement of speech corrupted by acoustic noise , 1979, ICASSP.

[18]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[19]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20]  Jun Du,et al.  Dynamic noise aware training for speech enhancement based on deep neural networks , 2014, INTERSPEECH.

[21]  H.B.D. Sorensen,et al.  A cepstral noise reduction multi-layer neural network , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[22]  Sridha Sridharan,et al.  The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms , 2010, INTERSPEECH.

[23]  Yariv Ephraim,et al.  A signal subspace approach for speech enhancement , 1995, IEEE Trans. Speech Audio Process..

[24]  Alex T. NELSONOregon Networks for Speech Enhancement , 1998 .

[25]  Sanjeev Khudanpur,et al.  Parallel training of DNNs with Natural Gradient and Parameter Averaging , 2014 .

[26]  R. McAulay,et al.  Speech enhancement using a soft-decision noise suppression filter , 1980 .

[27]  Dirk Van Compernolle,et al.  A family of MLP based nonlinear spectral estimators for noise reduction , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[28]  Jun Du,et al.  Robust speech recognition with speech enhanced deep neural networks , 2014, INTERSPEECH.

[29]  S. Tamura,et al.  An analysis of a noise reduction neural network , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[30]  Yang Lu,et al.  An algorithm that improves speech intelligibility in noise for normal-hearing listeners. , 2009, The Journal of the Acoustical Society of America.

[31]  DeLiang Wang,et al.  Binary and ratio time-frequency masks for robust speech recognition , 2006, Speech Commun..

[32]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[33]  Xiuhua Geng,et al.  A signal subspace approach for speech enhancement , 2014 .

[34]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .