Local spectral attention for full-band speech enhancement

Attention mechanism has been widely utilized in speech enhancement (SE) because theoretically it can effectively model the inherent connection of signal both in time domain and spectrum domain. Usually, the span of attention is limited in time domain while the attention in frequency domain spans the whole frequency range. In this paper, we notice that the attention over the whole frequency range hampers the inference for full-band SE and possibly leads to excessive residual noise. To alleviate this problem, we introduce local spectral attention (LSA) into full-band SE model by limiting the span of attention. The ablation test on the state-of-the-art (SOTA) full-band SE model reveals that the local frequency attention can effectively improve overall performance. The improved model achieves the best objective score on the full-band VoiceBank+DEMAND set.

[1]  Bin Yang,et al.  CMGAN: Conformer-Based Metric-GAN for Monaural Speech Enhancement , 2022, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2]  Shinji Watanabe,et al.  TF-GRIDNET: Making Time-Frequency Domain Models Great Again for Monaural Speaker Separation , 2022, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Qi Hu,et al.  A light-weight full-band speech enhancement model , 2022, ArXiv.

[4]  Chunliang Wang,et al.  Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speech Enhancement , 2022, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  W. Gan,et al.  FRCRN: Boosting Feature Representation Using Frequency Recurrence for Monaural Speech Enhancement , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Alberto N. Escalante,et al.  Deepfilternet2: Towards Real-Time Speech Enhancement on Embedded Devices for Full-Band Audio , 2022, 2022 International Workshop on Acoustic Signal Enhancement (IWAENC).

[7]  Haibo Wang,et al.  Optimizing Shoulder to Shoulder: A Coordinated Sub-Band Fusion Model for Real-Time Full-Band Speech Enhancement , 2022, ArXiv.

[8]  H. Meng,et al.  FullSubNet+: Channel Attention Fullsubnet with Complex Spectrograms for Speech Enhancement , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Lei Xie,et al.  S-DCCRN: Super Wide Band DCCRN with Learnable Complex Feature for Speech Enhancement , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Deliang Wang,et al.  TPARN: Triple-Path Attentive Recurrent Network for Time-Domain Multichannel Speech Enhancement , 2021, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Chengshi Zheng,et al.  Glance and Gaze: A Collaborative Learning Framework for Single-channel Speech Enhancement , 2021, Applied Acoustics.

[12]  Haibo Wang,et al.  DMF-Net: A decoupling-style multi-band fusion model for real-time full-band speech enhancement , 2022, ArXiv.

[13]  Eesung Kim,et al.  SE-Conformer: Time-Domain Speech Enhancement Using Conformer , 2021, Interspeech.

[14]  Jing Lu,et al.  DPCRN: Dual-Path Convolution Recurrent Network for Single Channel Speech Enhancement , 2021, Interspeech.

[15]  Chengshi Zheng,et al.  A Simultaneous Denoising and Dereverberation Framework with Target Decoupling , 2021, Interspeech.

[16]  Wenwu Wang,et al.  Convolutional fusion network for monaural speech enhancement , 2021, Neural Networks.

[17]  Chengshi Zheng,et al.  On the importance of power compression and phase estimation in monaural speech dereverberation. , 2021, JASA express letters.

[18]  Mirco Ravanelli,et al.  Attention Is All You Need In Speech Separation , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Kuldip K. Paliwal,et al.  Masked multi-head self-attention for causal speech enhancement , 2020, Speech Commun..

[20]  Deliang Wang,et al.  Dual-path Self-Attention RNN for Real-Time Speech Enhancement , 2020, arXiv.org.

[21]  Lei Xie,et al.  DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement , 2020, INTERSPEECH.

[22]  Dong Liu,et al.  Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation , 2020, INTERSPEECH.

[23]  Gabriel Synnaeve,et al.  Real Time Speech Enhancement in the Waveform Domain , 2020, INTERSPEECH.

[24]  DeLiang Wang,et al.  TCNN: Temporal Convolutional Neural Network for Real-time Speech Enhancement in the Time Domain , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Ke Tan,et al.  Complex Spectral Mapping with a Convolutional Recurrent Network for Monaural Speech Enhancement , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27]  DeLiang Wang,et al.  A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement , 2018, INTERSPEECH.

[28]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[29]  Junichi Yamagishi,et al.  Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech , 2016, SSW.

[30]  DeLiang Wang,et al.  Complex Ratio Masking for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[31]  Björn W. Schuller,et al.  Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR , 2015, LVA/ICA.

[32]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[33]  Karen A. Doherty,et al.  The Effect of Hearing Aid Noise Reduction on Listening Effort in Hearing-Impaired Adults , 2014, Ear and hearing.

[34]  Tim Brookes,et al.  On the Ideal Ratio Mask as the Goal of Computational Auditory Scene Analysis , 2014 .

[35]  Simon King,et al.  The voice bank corpus: Design, collection and data analysis of a large regional accent speech database , 2013, 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE).

[36]  Nobutaka Ito,et al.  The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings , 2013 .

[37]  Jesper Jensen,et al.  A short-time objective intelligibility measure for time-frequency weighted noisy speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[38]  Yi Hu,et al.  Evaluation of Objective Quality Measures for Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[39]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).