TPARN: Triple-Path Attentive Recurrent Network for Time-Domain Multichannel Speech Enhancement

In this work, we propose a new model called triple-path attentive recurrent network (TPARN) for multichannel speech enhancement in the time domain. TPARN extends a single-channel dual-path network to a multichannel network by adding a third path along the spatial dimension. First, TPARN processes speech signals from all channels independently using a dual-path attentive recurrent network (ARN), which is a recurrent neural network (RNN) augmented with self-attention. Next, an ARN is introduced along the spatial dimension for spatial context aggregation. TPARN is designed as a multiple-input and multiple-output architecture to enhance all input channels simultaneously. Experimental results demonstrate the superiority of TPARN over existing state-of-the-art approaches.

[1]  Yossi Adi,et al.  Online Self-Attentive Gated RNNs for Real-Time Speaker Separation , 2021, ArXiv.

[2]  Deliang Wang,et al.  Self-Attending RNN for Speech Enhancement to Improve Cross-Corpus Generalization , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Deliang Wang,et al.  Dual-path Self-Attention RNN for Real-Time Speech Enhancement , 2020, arXiv.org.

[4]  DeLiang Wang,et al.  Dense CNN With Self-Attention for Time-Domain Speech Enhancement , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Buye Xu,et al.  SAGRNN: Self-Attentive Gated RNN For Binaural Speaker Separation With Interaural Cue Preservation , 2020, IEEE Signal Processing Letters.

[6]  Dong Liu,et al.  Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation , 2020, INTERSPEECH.

[7]  Johannes Gehrke,et al.  The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results , 2020, INTERSPEECH.

[8]  DeLiang Wang,et al.  Densely Connected Neural Network with Dilated Convolutions for Real-Time Speech Enhancement in The Time Domain , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Zhong-Qiu Wang,et al.  Multi-Microphone Complex Spectral Mapping for Speech Dereverberation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Bahareh Tolooshams,et al.  Channel-Attention Dense U-Net for Multichannel Speech Enhancement , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Stephen Merity,et al.  Single Headed Attention RNN: Stop Thinking With Your Head , 2019, ArXiv.

[12]  N. Mesgarani,et al.  End-to-end Microphone Permutation and Number Invariant Multi-channel Speech Separation , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  T. Yoshioka,et al.  Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Yu Tsao,et al.  Multichannel Speech Enhancement by Raw Waveform-Mapping Using Fully Convolutional Networks , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Tetsuji Ogawa,et al.  Multi-Channel Speech Enhancement Using Time-Domain Convolutional Denoising Autoencoder , 2019, INTERSPEECH.

[16]  DeLiang Wang,et al.  Combining Spectral and Spatial Features for Deep Learning Based Blind Speaker Separation , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  DeLiang Wang,et al.  A New Framework for Supervised Speech Enhancement in the Time Domain , 2018, INTERSPEECH.

[19]  Nima Mesgarani,et al.  Real-time Single-channel Dereverberation and Separation with Time-domain Audio Separation Network , 2018, INTERSPEECH.

[20]  Zhong-Qiu Wang,et al.  Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Ivan Dokmanic,et al.  Pyroomacoustics: A Python Package for Audio Room Simulation and Array Processing Algorithms , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[23]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24]  Emmanuel Vincent,et al.  A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  Jonathan Le Roux,et al.  Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks , 2016, INTERSPEECH.

[26]  Reinhold Häb-Umbach,et al.  Neural network based spectral mask estimation for acoustic beamforming , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  R. Maas,et al.  A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing research , 2016, EURASIP J. Adv. Signal Process..

[28]  Jesper Jensen,et al.  A short-time objective intelligibility measure for time-frequency weighted noisy speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Boaz Rafaely,et al.  Microphone Array Signal Processing , 2008 .

[30]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[31]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[32]  Steve Renals,et al.  WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.