End-to-End Multi-Task Denoising for joint SDR and PESQ Optimization

Supervised learning based on a deep neural network recently has achieved substantial improvement on speech enhancement. Denoising networks learn mapping from noisy speech to clean one directly, or to a spectrum mask which is the ratio between clean and noisy spectra. In either case, the network is optimized by minimizing mean square error (MSE) between ground-truth labels and time-domain or spectrum output. However, existing schemes have either of two critical issues: spectrum and metric mismatches. The spectrum mismatch is a well known issue that any spectrum modification after short-time Fourier transform (STFT), in general, cannot be fully recovered after inverse short-time Fourier transform (ISTFT). The metric mismatch is that a conventional MSE metric is sub-optimal to maximize our target metrics, signal-to-distortion ratio (SDR) and perceptual evaluation of speech quality (PESQ). This paper presents a new end-to-end denoising framework with the goal of joint SDR and PESQ optimization. First, the network optimization is performed on the time-domain signals after ISTFT to avoid spectrum mismatch. Second, two loss functions which have improved correlations with SDR and PESQ metrics are proposed to minimize metric mismatch. The experimental result showed that the proposed denoising scheme significantly improved both SDR and PESQ performance over the existing methods.

[1]  Jung-Woo Ha,et al.  Phase-aware Speech Enhancement with Deep Complex U-Net , 2019, ICLR.

[2]  Yi Hu,et al.  Evaluation of Objective Quality Measures for Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Sandhya Hawaldar,et al.  Speech Enhancement for Nonstationary Noise Environments , 2011 .

[4]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Tatsuya Kawahara,et al.  Statistical Speech Enhancement Based on Probabilistic Integration of Variational Autoencoder and Non-Negative Matrix Factorization , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[7]  Junichi Yamagishi,et al.  Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech , 2016, SSW.

[8]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[9]  E. Zwicker,et al.  Das Ohr als Nachrichtenempfänger , 1967 .

[10]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[11]  Mira Lilleholt Vik Speech Enhancement with a Generative Adversarial Network , 2019 .

[12]  Antonio Bonafonte,et al.  SEGAN: Speech Enhancement Generative Adversarial Network , 2017, INTERSPEECH.

[13]  Sridha Sridharan,et al.  The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms , 2010, INTERSPEECH.

[14]  Hemant A. Patil,et al.  Time-Frequency Masking-Based Speech Enhancement Using Generative Adversarial Network , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Zhong-Qiu Wang,et al.  End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction , 2018, INTERSPEECH.

[16]  Jae S. Lim,et al.  Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[17]  DeLiang Wang,et al.  Ideal ratio mask estimation using deep neural networks for robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Jonathan Le Roux,et al.  SDR – Half-baked or Well Done? , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Björn W. Schuller,et al.  Discriminatively trained recurrent neural networks for single-channel speech separation , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[20]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[21]  Xin Wang,et al.  Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech , 2016 .

[22]  Simon King,et al.  The voice bank corpus: Design, collection and data analysis of a large regional accent speech database , 2013, 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE).

[23]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Jonathan Le Roux,et al.  Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Nobutaka Ito,et al.  The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings , 2013 .

[26]  S. Quackenbush Objective measures of speech quality (subjective) , 1985 .

[27]  Jon Barker,et al.  An analysis of environment, microphone and data simulation mismatches in robust speech recognition , 2017, Comput. Speech Lang..

[28]  Mingjiang Wang,et al.  Speech enhancement for nonstationary noise environments , 2017, 2017 IEEE 17th International Conference on Communication Technology (ICCT).

[29]  Xavier Serra,et al.  A Wavenet for Speech Denoising , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).