Translate Reverberated Speech to Anechoic Ones: Speech Dereverberation with BERT

Single channel speech dereverberation is considered in this work. Inspired by the recent success of Bidirectional Encoder Representations from Transformers (BERT) model in the domain of Natural Language Processing (NLP), we investigate its applicability as backbone sequence model to enhance reverberated speech signal. We present a variation of the basic BERT model: a pre-sequence network, which extracts local spectral-temporal information and/or provides order information, before the backbone sequence model. In addition, we use pre-trained neural vocoder for implicit phase reconstruction. To evaluate our method, we used the data from the 3rd CHiME challenge, and compare our results with other methods. Experiments show that the proposed method outperforms traditional method WPE, and achieve comparable performance with state-of-the-art BLSTM-based sequence models.

[1]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[2]  Jianwu Dang,et al.  Environment-Dependent Attention-Driven Recurrent Convolutional Neural Network for Robust Speech Enhancement , 2019, INTERSPEECH.

[3]  Reinhold Haeb-Umbach,et al.  NARA-WPE: A Python package for weighted prediction error dereverberation in Numpy and Tensorflow for online and offline processing , 2018, ITG Symposium on Speech Communication.

[4]  E. Lehmann,et al.  Prediction of energy decay in room impulse responses simulated with an image-source model. , 2008, The Journal of the Acoustical Society of America.

[5]  Ryuichi Yamamoto,et al.  Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Jingbo Zhu,et al.  Learning Deep Transformer Models for Machine Translation , 2019, ACL.

[7]  Shang-Wen Li,et al.  Audio Albert: A Lite Bert for Self-Supervised Learning of Audio Representation , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[8]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[9]  Antonio Bonafonte,et al.  SEGAN: Speech Enhancement Generative Adversarial Network , 2017, INTERSPEECH.

[10]  Sharon Gannot,et al.  Speech Dereverberation Using Fully Convolutional Networks , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[11]  DeLiang Wang,et al.  Complex Ratio Masking for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Jungwon Lee,et al.  T-GSA: Transformer with Gaussian-Weighted Self-Attention for Speech Enhancement , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Hung-yi Lee,et al.  Understanding Self-Attention of Self-Supervised Audio Transformers , 2020, INTERSPEECH.

[14]  Matthias Sperber,et al.  Self-Attentional Acoustic Models , 2018, INTERSPEECH.

[15]  DeLiang Wang,et al.  A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement , 2018, INTERSPEECH.

[16]  Marc Delcroix,et al.  Speech Enhancement Using Self-Adaptation and Multi-Head Self-Attention , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Yong Xu,et al.  An Attention-based Neural Network Approach for Single Channel Speech Enhancement , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Hung-yi Lee,et al.  Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Chin-Hui Lee,et al.  Convolutional-Recurrent Neural Networks for Speech Enhancement , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).