DF-Conformer: Integrated Architecture of Conv-Tasnet and Conformer Using Linear Complexity Self-Attention for Speech Enhancement

Single-channel speech enhancement (SE) is an important task in speech processing. A widely used framework combines an anal-ysis/synthesis filterbank with a mask prediction network, such as the Conv-TasNet architecture. In such systems, the denoising performance and computational efficiency are mainly affected by the structure of the mask prediction network. In this study, we aim to improve the sequential modeling ability of Conv-TasNet architectures by integrating Conformer layers into a new mask prediction network. To make the model computationally feasible, we extend the Conformer using linear complexity attention and stacked 1-D dilated depthwise convolution layers. We trained the model on 3,396 hours of noisy speech data, and show that (i) the use of linear complexity attention avoids high computational complexity, and (ii) our model achieves higher scale-invariant signal-to-noise ratio than the improved time-dilated convolution network (TDCN++), an extended version of Conv-TasNet.

[1]  Ming Zhou,et al.  Continuous Speech Separation with Conformer , 2020, ArXiv.

[2]  Dong Liu,et al.  Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation , 2020, INTERSPEECH.

[3]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4]  Zhuo Chen,et al.  ESPnet-SE: End-To-End Speech Enhancement and Separation Toolkit Designed for ASR Integration , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[5]  Mirco Ravanelli,et al.  Attention Is All You Need In Speech Separation , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  K. Takeda,et al.  Conformer-Based Sound Event Detection with Semi-Supervised Learning and Data Augmentation , 2020, DCASE.

[7]  Efthymios Tzinis,et al.  Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds , 2020, ICLR.

[8]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Antoine Deleforge,et al.  Filterbank Design for End-to-end Speech Separation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Tomohiro Nakatani,et al.  Frame-by-Frame Closed-Form Update for Mask-Based Adaptive MVDR Beamforming , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Joon Son Chung,et al.  The Conversation: Deep Audio-Visual Speech Enhancement , 2018, INTERSPEECH.

[12]  Sebastian Braun,et al.  ICASSP 2021 Deep Noise Suppression Challenge , 2020 .

[13]  Jonathan Le Roux,et al.  Universal Sound Separation , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[14]  Tomohiro Nakatani,et al.  Improving Noise Robust Automatic Speech Recognition with Single-Channel Time-Domain Enhancement Network , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Scott Wisdom,et al.  Differentiable Consistency Constraints for Improved Deep Speech Enhancement , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Jonathan Le Roux,et al.  Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks , 2016, INTERSPEECH.

[17]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[18]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Shmuel Peleg,et al.  Seeing Through Noise: Visually Driven Speaker Separation And Enhancement , 2017, ICASSP.

[20]  Wei-Ping Zhu,et al.  TSTNN: Two-Stage Transformer Based Neural Network for Speech Enhancement in the Time Domain , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Sebastian Braun,et al.  Towards Efficient Models for Real-Time Deep Noise Suppression , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Ron J. Weiss,et al.  Unsupervised Sound Separation Using Mixture Invariant Training , 2020, NeurIPS.

[23]  Jesper Jensen,et al.  An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24]  Nikolaos Pappas,et al.  Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , 2020, ICML.

[25]  Yi Tay,et al.  Efficient Transformers: A Survey , 2020, ArXiv.

[26]  Yi Luo,et al.  Ultra-Lightweight Speech Separation Via Group Communication , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[28]  Scott Wisdom,et al.  End-To-End Diarization for Variable Number of Speakers with Local-Global Networks and Discriminative Speaker Embeddings , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Yusuke Hioka,et al.  DNN-Based Source Enhancement to Increase Objective Sound Quality Assessment Score , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[30]  Lukasz Kaiser,et al.  Rethinking Attention with Performers , 2020, ArXiv.

[31]  Efthymios Tzinis,et al.  Asteroid: the PyTorch-based audio source separation toolkit for researchers , 2020, INTERSPEECH.