Speech Enhancement using Separable Polling Attention and Global Layer Normalization followed with PReLU

Single channel speech enhancement is a challenging task in speech community. Recently, various neural networks based methods have been applied to speech enhancement. Among these models, PHASEN and T-GSA achieve state-of-the-art performances on the publicly opened VoiceBank+DEMAND corpus. Both of the models reach the COVL score of 3.62. PHASEN achieves the highest CSIG score of 4.21 while TGSA gets the highest PESQ score of 3.06. However, both of these two models are very large. The contradiction between the model performance and the model size is hard to reconcile. In this paper, we introduce three kinds of techniques to shrink the PHASEN model and improve the performance. Firstly, seperable polling attention is proposed to replace the frequency transformation blocks in PHASEN. Secondly, global layer normalization followed with PReLU is used to replace batch normalization followed with ReLU. Finally, BLSTM in PHASEN is replaced with Conv2d operation and the phase stream is simplified. With all these modifications, the size of the PHASEN model is shrunk from 33M parameters to 5M parameters, while the performance on VoiceBank+DEMAND is improved to the CSIG score of 4.30, the PESQ score of 3.07 and the COVL score of 3.73.

[1]  Bin Yang,et al.  AeGAN: Time-Frequency Speech Denoising via Generative Adversarial Networks , 2019, 2020 28th European Signal Processing Conference (EUSIPCO).

[2]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[3]  Yu Tsao,et al.  Improving Perceptual Quality by Phone-Fortified Perceptual Loss for Speech Enhancement , 2020, ArXiv.

[4]  Tillman Weyde,et al.  Improved Speech Enhancement with the Wave-U-Net , 2018, ArXiv.

[5]  Nobutaka Ito,et al.  The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings , 2013 .

[6]  Jung-Woo Ha,et al.  Multi-Domain Processing via Hybrid Denoising Networks for Speech Enhancement , 2018, ArXiv.

[7]  Simon King,et al.  The voice bank corpus: Design, collection and data analysis of a large regional accent speech database , 2013, 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE).

[8]  Hemant A. Patil,et al.  Time-Frequency Mask-based Speech Enhancement using Convolutional Generative Adversarial Network , 2018, 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[9]  Antonio Bonafonte,et al.  SEGAN: Speech Enhancement Generative Adversarial Network , 2017, INTERSPEECH.

[10]  Deepak Baby,et al.  Sergan: Speech Enhancement Using Relativistic Generative Adversarial Networks with Gradient Penalty , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Jungwon Lee,et al.  T-GSA: Transformer with Gaussian-Weighted Self-Attention for Speech Enhancement , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[14]  Vladlen Koltun,et al.  Speech Denoising with Deep Feature Losses , 2018, INTERSPEECH.

[15]  Maarten De Vos,et al.  Improving GANs for Speech Enhancement , 2020, IEEE Signal Processing Letters.

[16]  Yan Li,et al.  NAAGN: Noise-Aware Attention-Gated Network for Speech Enhancement , 2020, INTERSPEECH.

[17]  Christophe Garcia,et al.  Simplifying ConvNets for Fast Learning , 2012, ICANN.

[18]  Jungwon Lee,et al.  End-to-End Multi-Task Denoising for joint SDR and PESQ Optimization , 2019, ArXiv.

[19]  Shou-De Lin,et al.  MetricGAN: Generative Adversarial Networks based Black-box Metric Scores Optimization for Speech Enhancement , 2019, ICML.

[20]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[21]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[22]  Zhiwei Xiong,et al.  PHASEN: A Phase-and-Harmonics-Aware Speech Enhancement Network , 2019, AAAI.

[23]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24]  Junichi Yamagishi,et al.  Speech Enhancement for a Noise-Robust Text-to-Speech Synthesis System Using Deep Recurrent Neural Networks , 2016, INTERSPEECH.

[25]  Xavier Serra,et al.  A Wavenet for Speech Denoising , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Umut Isik,et al.  Attention Wave-U-Net for Speech Enhancement , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[27]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.