论文信息 - Speech Denoising in the Waveform Domain With Self-Attention

Speech Denoising in the Waveform Domain With Self-Attention

In this work, we present CleanUNet, a causal speech denoising model on the raw waveform. The proposed model is based on an encoder-decoder architecture combined with several self-attention blocks to refine its bottleneck representations, which is crucial to obtain good results. The model is optimized through a set of losses defined over both waveform and multi-resolution spectrograms. The proposed method outperforms the state-of-the-art models in terms of denoised speech quality from various objective and subjective evaluation metrics.1

Bryan Catanzaro | Wei Ping | Zhifeng Kong | Ambrish Dantrey

[1] Yu Tsao,et al. MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement , 2021, Interspeech.

[2] Luc Soler,et al. U-Net Transformer: Self and Cross Attention for Medical Image Segmentation , 2021, MLMI@MICCAI.

[3] Xiang Hao,et al. Fullsubnet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Jean-Marc Valin,et al. PoCoNet: Better Speech Enhancement with Frequency-Positional Embeddings, Semi-Supervised Conversational Data, and Biased Loss , 2020, INTERSPEECH.

[5] Gabriel Synnaeve,et al. Real Time Speech Enhancement in the Waveform Domain , 2020, INTERSPEECH.

[6] Johannes Gehrke,et al. The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results , 2020, INTERSPEECH.

[7] Nils L. Westhausen,et al. Dual-Signal Transformation LSTM Network for Real-Time Noise Suppression , 2020, INTERSPEECH.

[8] Maarten De Vos,et al. Improving GANs for Speech Enhancement , 2020, IEEE Signal Processing Letters.

[9] Ryuichi Yamamoto,et al. Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] Hui Zhang,et al. UNetGAN: A Robust Speech Enhancement Approach in Time Domain for Extremely Low Signal-to-Noise Ratio Condition , 2019, INTERSPEECH.

[11] Kuldip K. Paliwal,et al. Deep learning for minimum mean-square error approaches to speech enhancement , 2019, Speech Commun..

[12] Shou-De Lin,et al. MetricGAN: Generative Adversarial Networks based Black-box Metric Scores Optimization for Speech Enhancement , 2019, ICML.

[13] DeLiang Wang,et al. TCNN: Temporal Convolutional Neural Network for Real-time Speech Enhancement in the Time Domain , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] Deepak Baby,et al. Sergan: Speech Enhancement Using Relativistic Generative Adversarial Networks with Gradient Penalty , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Nima Mesgarani,et al. Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16] Wei Ping,et al. ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech , 2018, ICLR.

[17] Vladlen Koltun,et al. Speech Denoising with Deep Feature Losses , 2018, INTERSPEECH.

[18] Simon Dixon,et al. Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation , 2018, ISMIR.

[19] Hemant A. Patil,et al. Time-Frequency Masking-Based Speech Enhancement Using Generative Adversarial Network , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20] Xavier Serra,et al. A Wavenet for Speech Denoising , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] Cassia Valentini-Botinhao,et al. Noisy speech database for training speech enhancement algorithms and TTS models , 2017 .

[22] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[23] Antonio Bonafonte,et al. SEGAN: Speech Enhancement Generative Adversarial Network , 2017, INTERSPEECH.

[24] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[25] DeLiang Wang,et al. Complex Ratio Masking for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26] Jun Du,et al. Multi-objective learning and mask-based post-processing for deep neural network based speech enhancement , 2017, INTERSPEECH.

[27] Björn W. Schuller,et al. Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR , 2015, LVA/ICA.

[28] Thomas Brox,et al. U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[29] DeLiang Wang,et al. A deep neural network for time-domain signal reconstruction , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30] Li-Rong Dai,et al. A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[31] Yu Tsao,et al. Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[32] Jesper Jensen,et al. An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[33] Yi Hu,et al. Evaluation of Objective Quality Measures for Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[34] Philipos C. Loizou,et al. Speech Enhancement: Theory and Practice , 2007 .

[35] Phil D. Green,et al. Speech enhancement with missing data techniques using recurrent neural networks , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[36] Methods for objective and subjective assessment of quality Perceptual evaluation of speech quality ( PESQ ) : An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs , 2002 .

[37] Alex Waibel,et al. Noise reduction using connectionist models , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[38] Julius O. Smith,et al. A flexible sampling-rate conversion method , 1984, ICASSP.

[39] A.V. Oppenheim,et al. Enhancement and bandwidth compression of noisy speech , 1979, Proceedings of the IEEE.

[40] S. Boll,et al. Suppression of acoustic noise in speech using spectral subtraction , 1979 .