Training Supervised Speech Separation System to Improve STOI and PESQ Directly

Supervised speech separation methods train learning machine to cast the noisy speech to the target clean speech. Most of them use mean-square error (MSE) as loss function. However, MSE is not the perfect choice because it doesn't match the human auditory perception. Short-time objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ) are closely related to the human auditory perception and widely used in speech separation research as evaluation criteria. Therefore, STOI and PESQ may be better choices for the loss function. However, they are nondifferentiable functions which cannot be optimized by the conventional gradient descent algorithm. In this work, a gradient approximation method is used to calculate the gradients of the STOI and PESQ. Then the calculated gradients are used in the gradient descent algorithm to optimize the STOI and PESQ directly. Experimental results show the speech separation performance can be improved by the proposed method.

[1]  Yi Hu,et al.  Evaluation of Objective Quality Measures for Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Frank Sehnke,et al.  Parameter-exploring policy gradients , 2010, Neural Networks.

[3]  J. Spall Multivariate stochastic approximation using a simultaneous perturbation gradient approximation , 1992 .

[4]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Nam Soo Kim,et al.  DNN-based monaural speech enhancement with temporal and spectral variations equalization , 2018, Digit. Signal Process..

[6]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[9]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[10]  Hui Zhang,et al.  A Pairwise Algorithm Using the Deep Stacking Network for Speech Separation and Pitch Estimation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Yurii Nesterov,et al.  Random Gradient-Free Minimization of Convex Functions , 2015, Foundations of Computational Mathematics.

[12]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[13]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Paris Smaragdis,et al.  Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[16]  Jürgen Schmidhuber,et al.  Highway Networks , 2015, ArXiv.

[17]  Changchun Bao,et al.  Wiener filtering based speech enhancement with Weighted Denoising Auto-encoder and noise classification , 2014, Speech Commun..

[18]  Panayiotis G. Georgiou,et al.  Perception Optimized Deep Denoising AutoEncoders for Speech Enhancement , 2016, INTERSPEECH.

[19]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[20]  강태균 Deep Learning Approach for Robust Voice Activity Detection and Speech Enhancement , 2017 .

[21]  DeLiang Wang,et al.  Exploring Monaural Features for Classification-Based Speech Segregation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  DeLiang Wang,et al.  A Supervised Learning Approach to Monaural Segregation of Reverberant Speech , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.