A Deep Learning Loss Function Based on the Perceptual Evaluation of the Speech Quality

This letter proposes a perceptual metric for speech quality evaluation, which is suitable, as a loss function, for training deep learning methods. This metric, derived from the perceptual evaluation of the speech quality algorithm, is computed in a per-frame basis and from the power spectra of the reference and processed speech signal. Thus, two disturbance terms, which account for distortion once auditory masking and threshold effects are factored in, amend the mean square error (MSE) loss function by introducing perceptual criteria based on human psychoacoustics. The proposed loss function is evaluated for noisy speech enhancement with deep neural networks. Experimental results show that our metric achieves significant gains in speech quality (evaluated using an objective metric and a listening test) when compared to using MSE or other perceptual-based loss functions from the literature.

[1]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Yu Tsao,et al.  SNR-Aware Convolutional Neural Network Modeling for Speech Enhancement , 2016, INTERSPEECH.

[3]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[4]  Zheng-Hua Tan,et al.  Conditional Generative Adversarial Networks for Speech Enhancement and Noise-Robust Speaker Verification , 2017, INTERSPEECH.

[5]  Yan Tang,et al.  A perceptually-weighted deep neural network for monaural speech enhancement in various background noise conditions , 2017, 2017 25th European Signal Processing Conference (EUSIPCO).

[6]  Changchun Bao,et al.  Wiener filtering based speech enhancement with Weighted Denoising Auto-encoder and noise classification , 2014, Speech Commun..

[7]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[8]  Xiongwei Zhang,et al.  Perceptual weighting deep neural networks for single-channel speech enhancement , 2016, 2016 12th World Congress on Intelligent Control and Automation (WCICA).

[9]  Ángel M. Gómez,et al.  Dual-channel DNN-based speech enhancement for smartphones , 2017, 2017 IEEE 19th International Workshop on Multimedia Signal Processing (MMSP).

[10]  Meng Sun,et al.  Joint optimization of audible noise suppression and deep neural networks for single-channel speech enhancement , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[11]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Jesper Jensen,et al.  Monaural Speech Enhancement Using Deep Neural Networks by Maximizing a Short-Time Objective Intelligibility Measure , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Panayiotis G. Georgiou,et al.  Perception Optimized Deep Denoising AutoEncoders for Speech Enhancement , 2016, INTERSPEECH.

[14]  Jun Du,et al.  Multiple-target deep learning for LSTM-RNN based speech enhancement , 2017, 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA).

[15]  DeLiang Wang,et al.  Towards Scaling Up Classification-Based Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Yu Tsao,et al.  End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20]  Nam Soo Kim,et al.  DNN-based monaural speech enhancement with temporal and spectral variations equalization , 2018, Digit. Signal Process..

[21]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[22]  Antonio Bonafonte,et al.  SEGAN: Speech Enhancement Generative Adversarial Network , 2017, INTERSPEECH.

[23]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[24]  Tao Zhang,et al.  Perceptually Guided Speech Enhancement Using Deep Neural Networks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Jun Du,et al.  Gaussian density guided deep neural network for single-channel speech enhancement , 2017, 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).

[26]  Dinei A. F. Florêncio,et al.  Speech Enhancement in Multiple-Noise Conditions Using Deep Neural Networks , 2016, INTERSPEECH.