Learning Complex Spectral Mapping for Speech Enhancement with Improved Cross-Corpus Generalization

It is recently revealed that deep learning based speech enhancement systems do not generalize to untrained corpora in low signal-to-noise ratio (SNR) conditions, mainly due to the channel mismatch between trained and untrained corpora. In this study, we investigate techniques to improve cross-corpus generalization of complex spectrogram enhancement. First, we propose a long short-term memory (LSTM) network for complex spectral mapping. Evaluated on untrained noises and corpora, the proposed network substantially outperforms a state-of-theart gated convolutional recurrent network (GCRN). Next, we examine the importance of training corpus for cross-corpus generalization. It is found that a training corpus that contains utterances with different channels can significantly improve performance on untrained corpora. Finally, we observe that using a smaller frame shift in short-time Fourier transform (STFT) is a simple but highly effective technique to improve cross-corpus generalization.

[1]  Kuldip K. Paliwal,et al.  The importance of phase in speech enhancement , 2011, Speech Commun..

[2]  Ke Tan,et al.  Complex Spectral Mapping with a Convolutional Recurrent Network for Monaural Speech Enhancement , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[4]  DeLiang Wang,et al.  Exploring Deep Complex Networks for Complex Spectrogram Enhancement , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  IEEE Recommended Practice for Speech Quality Measurements , 1969, IEEE Transactions on Audio and Electroacoustics.

[6]  DeLiang Wang,et al.  On Cross-Corpus Generalization of Deep Learning Based Speech Enhancement , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Jung-Woo Ha,et al.  Phase-aware Speech Enhancement with Deep Complex U-Net , 2019, ICLR.

[8]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  DeLiang Wang,et al.  Learning Complex Spectral Mapping With Gated Convolutional Recurrent Networks for Monaural Speech Enhancement , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  DeLiang Wang,et al.  Complex Ratio Masking for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  DeLiang Wang,et al.  TCNN: Temporal Convolutional Neural Network for Real-time Speech Enhancement in the Time Domain , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Jae S. Lim,et al.  The unimportance of phase in speech enhancement , 1982 .

[16]  Yu Tsao,et al.  Complex spectrogram enhancement by convolutional neural network with multi-metrics learning , 2017, 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).

[17]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[18]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[19]  DeLiang Wang,et al.  A New Framework for CNN-Based Speech Enhancement in the Time Domain , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20]  DeLiang Wang,et al.  Long short-term memory for speaker generalization in supervised speech separation. , 2017, The Journal of the Acoustical Society of America.

[21]  DeLiang Wang,et al.  Gated Residual Networks With Dilated Convolutions for Monaural Speech Enhancement , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[23]  Jinwon Lee,et al.  A Fully Convolutional Neural Network for Speech Enhancement , 2016, INTERSPEECH.

[24]  DeLiang Wang,et al.  On Adversarial Training and Loss Functions for Speech Enhancement , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[26]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  WangDeLiang,et al.  A New Framework for CNN-Based Speech Enhancement in the Time Domain , 2019 .

[28]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[29]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).