Learning Complex Spectral Mapping With Gated Convolutional Recurrent Networks for Monaural Speech Enhancement

Phase is important for perceptual quality of speech. However, it seems intractable to directly estimate phase spectra through supervised learning due to their lack of spectrotemporal structure in it. Complex spectral mapping aims to estimate the real and imaginary spectrograms of clean speech from those of noisy speech, which simultaneously enhances magnitude and phase responses of speech. Inspired by multi-task learning, we propose a gated convolutional recurrent network (GCRN) for complex spectral mapping, which amounts to a causal system for monaural speech enhancement. Our experimental results suggest that the proposed GCRN substantially outperforms an existing convolutional neural network (CNN) for complex spectral mapping in terms of both objective speech intelligibility and quality. Moreover, the proposed approach yields significantly higher STOI and PESQ than magnitude spectral mapping and complex ratio masking. We also find that complex spectral mapping with the proposed GCRN provides an effective phase estimate.

[1]  Mark D. Plumbley,et al.  Single channel audio source separation using convolutional denoising autoencoders , 2017, 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[2]  DeLiang Wang,et al.  Long short-term memory for speaker generalization in supervised speech separation. , 2017, The Journal of the Acoustical Society of America.

[3]  Tom Barker,et al.  Low latency sound source separation using convolutional recurrent neural networks , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[4]  DeLiang Wang,et al.  Gated Residual Networks With Dilated Convolutions for Monaural Speech Enhancement , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Hal Daumé,et al.  Learning Task Grouping and Overlap in Multi-task Learning , 2012, ICML.

[6]  Yu Tsao,et al.  End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Yu Tsao,et al.  Complex spectrogram enhancement by convolutional neural network with multi-metrics learning , 2017, 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).

[8]  Björn W. Schuller,et al.  Single-channel speech separation with memory-enhanced recurrent neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[10]  Jinwon Lee,et al.  A Fully Convolutional Neural Network for Speech Enhancement , 2016, INTERSPEECH.

[11]  Guy J. Brown,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2006 .

[12]  Yu Tsao,et al.  Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[13]  Ke Tan,et al.  Complex Spectral Mapping with a Convolutional Recurrent Network for Monaural Speech Enhancement , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[15]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[16]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[17]  Jonathan Le Roux,et al.  Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[19]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20]  Björn W. Schuller,et al.  Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR , 2015, LVA/ICA.

[21]  DeLiang Wang,et al.  A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement , 2018, INTERSPEECH.

[22]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[23]  Li Zhao,et al.  Efficient Sequence Learning with Group Recurrent Networks , 2018, NAACL.

[24]  Naoya Takahashi,et al.  Mmdenselstm: An Efficient Combination of Convolutional and Recurrent Neural Networks for Audio Source Separation , 2018, 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC).

[25]  Yu Tsao,et al.  SNR-Aware Convolutional Neural Network Modeling for Speech Enhancement , 2016, INTERSPEECH.

[26]  Björn W. Schuller,et al.  Discriminatively trained recurrent neural networks for single-channel speech separation , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[27]  Pejman Mowlaee Begzade Mahale,et al.  Phase Estimation in Single Channel Speech Enhancement Using Phase Decomposition , 2015, IEEE Signal Processing Letters.

[28]  Timo Gerkmann,et al.  STFT Phase Reconstruction in Voiced Speech for an Improved Single-Channel Speech Enhancement , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[29]  DeLiang Wang,et al.  A deep neural network for time-domain signal reconstruction , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Paris Smaragdis,et al.  Deep learning for monaural speech separation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Sanjiv Kumar,et al.  On the Convergence of Adam and Beyond , 2018 .

[32]  Jae S. Lim,et al.  The unimportance of phase in speech enhancement , 1982 .

[33]  Qiang Yang,et al.  An Overview of Multi-task Learning , 2018 .

[34]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[36]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[37]  Kuldip K. Paliwal,et al.  The importance of phase in speech enhancement , 2011, Speech Commun..

[38]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[39]  Rainer Martin,et al.  Phase estimation for signal reconstruction in single-channel source separation , 2012, INTERSPEECH.

[40]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[41]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[42]  DeLiang Wang,et al.  A New Framework for Supervised Speech Enhancement in the Time Domain , 2018, INTERSPEECH.

[43]  DeLiang Wang,et al.  Complex Ratio Masking for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[44]  DeLiang Wang,et al.  Learning spectral mapping for speech dereverberation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).