Lightweight Causal Transformer with Local Self-Attention for Real-Time Speech Enhancement

In this paper, we describe a novel speech enhancement transformer architecture. The model uses local causal selfattention, which makes it lightweight and therefore particularly well-suited for real-time speech enhancement in computation resource-limited environments. In addition, we provide several ablation studies that focus on different parts of the model and the loss function to figure out which modifications yield best improvements. Using this knowledge, we propose a final version of our architecture, that we sent in to the INTERSPEECH 2021 DNS Challenge, where it achieved competitive results, despite using only 2% of the maximally allowed computation. Furthermore, we performed experiments to compare it with with LSTM and CNN models, that had 127% and 257% more parameters, respectively. Despite this difference in model size, we achieved significant improvements on the considered speech quality and intelligibility measures.

[1]  Kuldip K. Paliwal,et al.  Masked multi-head self-attention for causal speech enhancement , 2020, Speech Commun..

[2]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Methods for objective and subjective assessment of quality Perceptual evaluation of speech quality ( PESQ ) : An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs , 2002 .

[5]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[6]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[8]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[9]  Marc Delcroix,et al.  Speech Enhancement Using Self-Adaptation and Multi-Head Self-Attention , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Kuldip K. Paliwal,et al.  Deep learning for minimum mean-square error approaches to speech enhancement , 2019, Speech Commun..

[11]  Di He,et al.  Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View , 2019, ArXiv.

[12]  Alex Waibel,et al.  Noise reduction using connectionist models , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[13]  Jesper Jensen,et al.  An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Jack W. Rae,et al.  Do Transformers Need Deep Long-Range Memory? , 2020, ACL.

[15]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[16]  Jungwon Lee,et al.  T-GSA: Transformer with Gaussian-Weighted Self-Attention for Speech Enhancement , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Lukasz Kaiser,et al.  Generating Wikipedia by Summarizing Long Sequences , 2018, ICLR.

[18]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[20]  Ross Cutler,et al.  Interspeech 2021 Deep Noise Suppression Challenge , 2021, ArXiv.

[21]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[22]  Björn W. Schuller,et al.  Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR , 2015, LVA/ICA.

[23]  Noam Shazeer,et al.  GLU Variants Improve Transformer , 2020, ArXiv.

[24]  Douglas Eck,et al.  Music Transformer , 2018, 1809.04281.

[25]  Angel Manuel Gomez,et al.  A Deep Learning Loss Function Based on the Perceptual Evaluation of the Speech Quality , 2018, IEEE Signal Processing Letters.

[26]  Pascal Scalart,et al.  Speech enhancement based on a priori signal to noise estimation , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[27]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[28]  Phil D. Green,et al.  Speech enhancement with missing data techniques using recurrent neural networks , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[29]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[30]  Jiri Malek,et al.  Single channel speech enhancement using convolutional neural network , 2017, 2017 IEEE International Workshop of Electronics, Control, Measurement, Signals and their Application to Mechatronics (ECMSM).

[31]  Jun Du,et al.  Frequency Gating: Improved Convolutional Neural Networks for Speech Enhancement in the Time-Frequency Domain , 2020, 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[32]  Jinwon Lee,et al.  A Fully Convolutional Neural Network for Speech Enhancement , 2016, INTERSPEECH.

[33]  Antonio Bonafonte,et al.  SEGAN: Speech Enhancement Generative Adversarial Network , 2017, INTERSPEECH.

[34]  Bhiksha Raj,et al.  Speech denoising using nonnegative matrix factorization with priors , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[35]  Yi Hu,et al.  A generalized subspace approach for enhancing speech corrupted by colored noise , 2003, IEEE Trans. Speech Audio Process..

[36]  Ian McLoughlin,et al.  Self-Attention Generative Adversarial Network for Speech Enhancement , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Jun Du,et al.  A speech enhancement approach using piecewise linear approximation of an explicit model of environmental distortions , 2008, INTERSPEECH.