Residual Recurrent Neural Network for Speech Enhancement

Most current speech enhancement models use spectrogram features that require an expensive transformation and result in phase information loss. Previous work has overcome these issues by using convolutional networks to learn the temporal correlations across high-resolution waveforms. These models, however, are limited by memory-intensive dilated convolution and aliasing artifacts from upsampling. We introduce an end-to-end fully recurrent neural network for single-channel speech enhancement. The network structured as an hourglass-shape that can efficiently capture longrange temporal dependencies by reducing the features resolution without information loss. Also, we use residual connections to prevent gradient decay over layers and improve the model generalization. Experimental results show that our model outperforms state-of-the-art approaches in six quantitative evaluation metrics.

[1]  Antonio Bonafonte,et al.  SEGAN: Speech Enhancement Generative Adversarial Network , 2017, INTERSPEECH.

[2]  Gaël Richard,et al.  Main instrument separation from stereophonic audio signals using a source/filter model , 2009, 2009 17th European Signal Processing Conference.

[3]  Mark Hasegawa-Johnson,et al.  Speech Enhancement Using Bayesian Wavenet , 2017, INTERSPEECH.

[4]  Simon Dixon,et al.  Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation , 2018, ISMIR.

[5]  Mingjiang Wang,et al.  Speech enhancement for nonstationary noise environments , 2017, 2017 IEEE 17th International Conference on Communication Technology (ICCT).

[6]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[7]  Simon King,et al.  The voice bank corpus: Design, collection and data analysis of a large regional accent speech database , 2013, 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE).

[8]  Daniel George,et al.  Denoising Gravitational Waves with Enhanced Deep Recurrent Denoising Auto-encoders , 2017, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Alan V. Oppenheim,et al.  All-pole modeling of degraded speech , 1978 .

[10]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[11]  Junichi Yamagishi,et al.  Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech , 2016, SSW.

[12]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[13]  Sandhya Hawaldar,et al.  Speech Enhancement for Nonstationary Noise Environments , 2011 .

[14]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[16]  Hemant A. Patil,et al.  Time-Frequency Masking-Based Speech Enhancement Using Generative Adversarial Network , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Yariv Ephraim,et al.  Statistical-model-based speech enhancement systems , 1992, Proc. IEEE.

[18]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[20]  Vladlen Koltun,et al.  Speech Denoising with Deep Feature Losses , 2018, INTERSPEECH.

[21]  Jae S. Lim,et al.  Speech enhancement , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[22]  Nobutaka Ito,et al.  The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings , 2013 .

[23]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[24]  Richard M. Schwartz,et al.  Enhancement of speech corrupted by acoustic noise , 1979, ICASSP.

[25]  Naoya Takahashi,et al.  PhaseNet: Discretized Phase Modeling with Deep Neural Networks for Audio Source Separation , 2018, INTERSPEECH.

[26]  Jonathan Le Roux,et al.  Deep Recurrent Networks for Separation and Recognition of Single-Channel Speech in Nonstationary Background Audio , 2017, New Era for Robust Speech Recognition, Exploiting Deep Learning.

[27]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[28]  Kuldip K. Paliwal,et al.  The importance of phase in speech enhancement , 2011, Speech Commun..

[29]  Xavier Serra,et al.  A Wavenet for Speech Denoising , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Li-Rong Dai,et al.  Waveform Modeling and Generation Using Hierarchical Recurrent Neural Networks for Speech Bandwidth Extension , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[31]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[32]  Xin Wang,et al.  Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech , 2016 .

[33]  George Carayannis,et al.  Speech enhancement from noise: A regenerative approach , 1991, Speech Commun..

[34]  Fabien Ringeval,et al.  Facing Realism in Spontaneous Emotion Recognition from Speech: Feature Enhancement by Autoencoder with LSTM Neural Networks , 2016, INTERSPEECH.

[35]  DeLiang Wang,et al.  Complex ratio masking for joint enhancement of magnitude and phase , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Bhiksha Raj,et al.  Speech denoising using nonnegative matrix factorization with priors , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.