A Time-domain Monaural Speech Enhancement with Recursive Learning

In this paper, we propose a type of neural network with recursive learning in the time domain called RTNet for monaural speech enhancement, where the proposed network consists of three principal components. The first part is called stage recurrent neural network, which is introduced to effectively aggregate the deep feature dependencies across different stages with a memory mechanism and also remove the interference stage by stage. The second part is the convolutional auto-encoder. The third part consists of a series of concatenated gated linear units, which are capable of facilitating the information flow and gradually increasing the receptive fields. Recursive learning is adopted to improve the parameter efficiency and therefore, the number of trainable parameters is effectively reduced without sacrificing its performance. Numerous experiments are conducted on TIMIT corpus and experimental results demonstrate that the proposed network can achieve consistently better performance in terms of both PESQ and STOI scores than two state-of-the-art time domain-based baselines in different conditions. The code is provided at this https URL Andong-Li-speech/RTNet.

[1]  DeLiang Wang,et al.  Gated Residual Networks With Dilated Convolutions for Monaural Speech Enhancement , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[3]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Xiaodong Li,et al.  A cepstrum-based preprocessing and postprocessing for speech enhancement in adverse environments , 2013 .

[5]  Jun Du,et al.  Densely Connected Progressive Learning for LSTM-Based Speech Enhancement , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[7]  Yu Tsao,et al.  Complex spectrogram enhancement by convolutional neural network with multi-metrics learning , 2017, 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).

[8]  Seunghoon Hong,et al.  Learning Deconvolution Network for Semantic Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[9]  Søren Holdt Jensen,et al.  Reduction of broad-band noise in speech by truncated QSVD , 1995, IEEE Trans. Speech Audio Process..

[10]  Kuldip K. Paliwal,et al.  The importance of phase in speech enhancement , 2011, Speech Commun..

[11]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[12]  Xavier Serra,et al.  A Wavenet for Speech Denoising , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  Yu Tsao,et al.  End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Paris Smaragdis,et al.  Speech Enhancement by Online Non-negative Spectrogram Decomposition in Non-stationary Noise Environments , 2012, INTERSPEECH.

[16]  Jae Lim,et al.  Signal estimation from modified short-time Fourier transform , 1984 .

[17]  Qinghua Hu,et al.  Progressive Image Deraining Networks: A Better and Simpler Baseline , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  DeLiang Wang,et al.  A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement , 2018, INTERSPEECH.

[19]  Ivan Marsic,et al.  RHR-Net: A Residual Hourglass Recurrent Neural Network for Speech Enhancement , 2019, ArXiv.

[20]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[21]  Antonio Bonafonte,et al.  SEGAN: Speech Enhancement Generative Adversarial Network , 2017, INTERSPEECH.

[22]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[23]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24]  Jonathan Le Roux,et al.  Explicit consistency constraints for STFT spectrograms and their application to phase reconstruction , 2008, SAPA@INTERSPEECH.

[25]  Jonah Casebeer,et al.  Adaptive Front-ends for End-to-end Source Separation , 2017 .

[26]  Christopher Joseph Pal,et al.  Delving Deeper into Convolutional Networks for Learning Video Representations , 2015, ICLR.

[27]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[28]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[29]  Xiaodong Li,et al.  Convolutional Recurrent Neural Network Based Progressive Learning for Monaural Speech Enhancement , 2019, ArXiv.

[30]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  DeLiang Wang,et al.  A New Framework for CNN-Based Speech Enhancement in the Time Domain , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[32]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Robust Semantic Pixel-Wise Labelling , 2015, CVPR 2015.

[33]  DeLiang Wang,et al.  Complex Ratio Masking for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[34]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[35]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.