Using Separate Losses for Speech and Noise in Mask-Based Speech Enhancement

Estimating time-frequency domain masks for speech enhancement using deep learning approaches has recently become a popular field in research. In this paper, we propose a novel components loss (CL) for the training of neural networks for speech enhancement. During the training process, the proposed CL offers separate control over suppression of the noise component and preservation of the speech component. We illustrate the potential of the proposed CL by example of a convolutional neural network (CNN) for mask-based speech enhancement. We show improvement in almost all employed instrumental quality metrics over the baseline losses, which comprises the conventional mean squared error (MSE) loss and also perceptual evaluation of speech quality (PESQ) loss. On average, more than 0.3 dB higher SNR improvement and an at least 0.1 points higher PESQ score on the speech component are obtained. In addition to that, a more naturally sounding residual noise and a consistently best PESQ on the enhanced speech is obtained. All improvements are more distinct at low SNR conditions.

[1]  Panayiotis G. Georgiou,et al.  Perception Optimized Deep Denoising AutoEncoders for Speech Enhancement , 2016, INTERSPEECH.

[2]  DeLiang Wang,et al.  Complex Ratio Masking for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Pascal Scalart,et al.  Speech enhancement based on a priori signal to noise estimation , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[4]  Tim Fingscheidt,et al.  Convolutional Neural Networks to Enhance Coded Speech , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Tim Fingscheidt,et al.  Quality assessment of speech enhancement systems by separation of enhanced speech, noise, and echo , 2007, INTERSPEECH.

[6]  Tim Fingscheidt,et al.  MMSE speech enhancement under speech presence uncertainty assuming (generalized) gamma speech priors throughout , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Wouter Tirry,et al.  Instantaneous A Priori SNR Estimation by Cepstral Excitation Manipulation , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Jesper Jensen,et al.  A short-time objective intelligibility measure for time-frequency weighted noisy speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  DeLiang Wang,et al.  A deep neural network for time-domain signal reconstruction , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Angel Manuel Gomez,et al.  A Deep Learning Loss Function Based on the Perceptual Evaluation of the Speech Quality , 2018, IEEE Signal Processing Letters.

[11]  Israel Cohen,et al.  Speech enhancement using super-Gaussian speech models and noncausal a priori SNR estimation , 2005, Speech Commun..

[12]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Wouter Tirry,et al.  An iterative speech model-based a priori SNR estimator , 2015, INTERSPEECH.

[14]  Jesper Jensen,et al.  Monaural Speech Enhancement Using Deep Neural Networks by Maximizing a Short-Time Objective Intelligibility Measure , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  David Malah,et al.  Tracking speech-presence uncertainty to improve speech enhancement in non-stationary noise environments , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[16]  Rainer Martin,et al.  Improved A Posteriori Speech Presence Probability Estimation Based on a Likelihood Ratio With Fixed Priors , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Jonathan Le Roux,et al.  Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Jun Du,et al.  A Regression Approach to Single-Channel Speech Separation Via High-Resolution Deep Neural Networks , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[20]  Tim Fingscheidt,et al.  Black box measurement of musical tones produced by noise reduction systems , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[22]  Björn W. Schuller,et al.  Discriminatively trained recurrent neural networks for single-channel speech separation , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[23]  Yan Tang,et al.  A perceptually-weighted deep neural network for monaural speech enhancement in various background noise conditions , 2017, 2017 25th European Signal Processing Conference (EUSIPCO).

[24]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .