A Maximum Likelihood Approach to Masking-based Speech Enhancement Using Deep Neural Network

The minimum mean squared error (MMSE) is usually adopted as the training criterion for speech enhancement based on deep neural network (DNN). In this study, we propose a probabilistic learning framework to optimize the DNN parameter for masking-based speech enhancement. Ideal ratio mask (IRM) is used as the learning target and its prediction error vector at the DNN output is modeled to follow statistically independent generalized Gaussian distribution (GGD). Accordingly, we present a maximum likelihood (ML) approach to DNN parameter optimization. We analyze and discuss the effect of shape parameter of GGD on noise reduction and speech preservation. Experimental results on the TIMIT corpus show the proposed ML-based learning approach can achieve consistent improvements over MMSE-based DNN learning on all evaluation metrics. Less speech distortion is observed in ML-based approach especially for high frequency units than MMSE-based approach.

[1]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[2]  Jun Du,et al.  A Multiobjective Learning and Ensembling Approach to High-Performance Speech Enhancement With Compact Neural Network Architectures , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Björn W. Schuller,et al.  Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR , 2015, LVA/ICA.

[4]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[5]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[7]  Nilesh Madhu,et al.  The Potential for Speech Intelligibility Improvement Using the Ideal Binary Mask and the Ideal Wiener Filter in Single Channel Noise Reduction Systems: Application to Auditory Prostheses , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Yu Tsao,et al.  Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[10]  Jun Du,et al.  Gaussian density guided deep neural network for single-channel speech enhancement , 2017, 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).

[11]  Jun Du,et al.  Joint noise and mask aware training for DNN-based speech enhancement with SUB-band features , 2017, 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA).

[12]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[13]  Yusuke Hioka,et al.  DNN-Based Source Enhancement to Increase Objective Sound Quality Assessment Score , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Jun Du,et al.  A maximum likelihood approach to deep neural network based speech dereverberation , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[15]  DeLiang Wang,et al.  A Deep Ensemble Learning Method for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  DeLiang Wang,et al.  Complex ratio masking for joint enhancement of magnitude and phase , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Paris Smaragdis,et al.  Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Changchun Bao,et al.  Wiener filtering based speech enhancement with Weighted Denoising Auto-encoder and noise classification , 2014, Speech Commun..

[20]  DeLiang Wang,et al.  Ideal ratio mask estimation using deep neural networks for robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Israel Cohen,et al.  Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging , 2003, IEEE Trans. Speech Audio Process..

[22]  Jun Du,et al.  A speech enhancement approach using piecewise linear approximation of an explicit model of environmental distortions , 2008, INTERSPEECH.

[23]  Alan V. Oppenheim,et al.  All-pole modeling of degraded speech , 1978 .

[24]  DeLiang Wang,et al.  Binary and ratio time-frequency masks for robust speech recognition , 2006, Speech Commun..

[25]  M. West On scale mixtures of normal distributions , 1987 .

[26]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[27]  Jonathan Le Roux,et al.  Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Jun Du,et al.  A Maximum Likelihood Approach to Deep Neural Network Based Nonlinear Spectral Mapping for Single-Channel Speech Separation , 2017, INTERSPEECH.

[29]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[30]  Panayiotis G. Georgiou,et al.  Perception Optimized Deep Denoising AutoEncoders for Speech Enhancement , 2016, INTERSPEECH.

[31]  DeLiang Wang,et al.  DNN Based Mask Estimation for Supervised Speech Separation , 2018 .