NAAGN: Noise-Aware Attention-Gated Network for Speech Enhancement

For single channel speech enhancement, contextual information is very important for accurate speech estimation. In this paper, to capture long-term temporal contexts, we treat speech enhancement as a sequence-to-sequence mapping problem, and propose a noise-aware attention-gated network (NAAGN) for speech enhancement. Firstly, by incorporating deep residual learning and dilated convolutions into U-Net architecture, we present a deep residual U-net (ResUNet), which significantly expand receptive fields to aggregate context information systematically. Secondly, the attention-gated (AG) network is integrated into the ResUNet architecture with minimal computational overhead while furtherly increasing the longterm contexts sensitivity and prediction accuracy. Thirdly, we propose a novel noise-aware multi-task loss function, named weighted mean absolute error (WMAE) loss, in which both speech estimation loss and noise prediction loss are taken into consideration. Finally, the proposed NAAGN model was evaluated on the Voice Bank corpus and DEMAND database, which have been widely applied for speech enhancement by lots of deep learning models. Experimental results indicate that the proposed NAAGN method can achieve a larger segmental SNR improvement, a better speech quality and a higher speech intelligibility than reference methods.

[1]  Jung-Woo Ha,et al.  Multi-Domain Processing via Hybrid Denoising Networks for Speech Enhancement , 2018, ArXiv.

[2]  Nobutaka Ito,et al.  The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings , 2013 .

[3]  Junichi Yamagishi,et al.  Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech , 2016, SSW.

[4]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[5]  Jung-Woo Ha,et al.  Phase-aware Speech Enhancement with Deep Complex U-Net , 2019, ICLR.

[6]  Pascal Scalart,et al.  Speech enhancement based on a priori signal to noise estimation , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[7]  Loïc Le Folgoc,et al.  Attention U-Net: Learning Where to Look for the Pancreas , 2018, ArXiv.

[8]  Yi Hu,et al.  Evaluation of Objective Quality Measures for Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Vladlen Koltun,et al.  Speech Denoising with Deep Feature Losses , 2018, INTERSPEECH.

[11]  DeLiang Wang,et al.  Long short-term memory for speaker generalization in supervised speech separation. , 2017, The Journal of the Acoustical Society of America.

[12]  Changchun Bao,et al.  Speech enhancement using generalized weighted β-order spectral amplitude estimator , 2014, Speech Commun..

[13]  DeLiang Wang,et al.  Gated Residual Networks With Dilated Convolutions for Monaural Speech Enhancement , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Simon King,et al.  The voice bank corpus: Design, collection and data analysis of a large regional accent speech database , 2013, 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE).

[15]  Tom Barker,et al.  Low latency sound source separation using convolutional recurrent neural networks , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[16]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Simon Dixon,et al.  Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation , 2018, ISMIR.

[18]  DeLiang Wang,et al.  Gated Residual Networks with Dilated Convolutions for Supervised Speech Separation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  DeLiang Wang,et al.  A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement , 2018, INTERSPEECH.

[20]  W. Bastiaan Kleijn,et al.  Sparse Hidden Markov Models for Speech Enhancement in Non-Stationary Noise Environments , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Xavier Serra,et al.  A Wavenet for Speech Denoising , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Deepak Baby,et al.  Sergan: Speech Enhancement Using Relativistic Generative Adversarial Networks with Gradient Penalty , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Antonio Bonafonte,et al.  SEGAN: Speech Enhancement Generative Adversarial Network , 2017, INTERSPEECH.

[25]  Tillman Weyde,et al.  Improved Speech Enhancement with the Wave-U-Net , 2018, ArXiv.

[26]  Chao Li,et al.  A novel multi-band spectral subtraction method based on phase modification and magnitude compensation , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Changchun Bao,et al.  A speech enhancement method by coupling speech detection and spectral amplitude estimation , 2013, INTERSPEECH.