论文信息 - A Cross-Channel Attention-Based Wave-U-Net for Multi-Channel Speech Enhancement

A Cross-Channel Attention-Based Wave-U-Net for Multi-Channel Speech Enhancement

In this paper, we present a novel architecture for multichannel speech enhancement using a cross-channel attentionbased Wave-U-Net structure. Despite the advantages of utilizing spatial information as well as spectral information, it is challenging to effectively train a multi-channel deep learning system in an end-to-end framework. With a channel-independent encoding architecture for spectral estimation and a strategy to extract spatial information through an inter-channel attention mechanism, we implement a multi-channel speech enhancement system that has high performance even in reverberant and extremely noisy environments. Experimental results show that the proposed architecture has superior performance in terms of signal-to-distortion ratio improvement (SDRi), short-time objective intelligence (STOI), and phoneme error rate (PER) for speech recognition.

Dong Hoon Yi | Jinyoung Lee | Hong-Goo Kang | Minh Tri Ho | Bong-Ki Lee

[1] Umut Isik,et al. Attention Wave-U-Net for Speech Enhancement , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[2] Bahareh Tolooshams,et al. Channel-Attention Dense U-Net for Multichannel Speech Enhancement , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Nobutaka Ito,et al. The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings , 2013 .

[4] Tillman Weyde,et al. Singing Voice Separation with Deep U-Net Convolutional Networks , 2017, ISMIR.

[5] Emmanuel Vincent,et al. A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] Takuya Yoshioka,et al. Exploring Practical Aspects of Neural Mask-Based Beamforming for Far-Field Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] Tetsuji Ogawa,et al. Multi-Channel Speech Enhancement Using Time-Domain Convolutional Denoising Autoencoder , 2019, INTERSPEECH.

[9] Carla Teixeira Lopes,et al. TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[10] Dhany Arifianto,et al. Speech enhancement on smartphone voice recording , 2016 .

[11] Jonathan Le Roux,et al. Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks , 2016, INTERSPEECH.

[12] Emilia Gómez,et al. Monoaural Audio Source Separation Using Deep Convolutional Neural Networks , 2017, LVA/ICA.

[13] Jung-Woo Ha,et al. Phase-aware Speech Enhancement with Deep Complex U-Net , 2019, ICLR.

[14] Ivan Dokmanic,et al. Pyroomacoustics: A Python Package for Audio Room Simulation and Array Processing Algorithms , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Liang Lu,et al. Deep beamforming networks for multi-channel speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Jont B. Allen,et al. Image method for efficiently simulating small‐room acoustics , 1976 .

[17] Joerg Bitzer,et al. Post-Filtering Techniques , 2001, Microphone Arrays.

[18] DeLiang Wang,et al. Combining Spectral and Spatial Features for Deep Learning Based Blind Speaker Separation , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19] DeLiang Wang,et al. Complex Ratio Masking for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20] Jae S. Lim,et al. Speech enhancement , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[21] Loïc Le Folgoc,et al. Attention U-Net: Learning Where to Look for the Pancreas , 2018, ArXiv.

[22] Simon Dixon,et al. Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation , 2018, ISMIR.

[23] Michael S. Brandstein,et al. Microphone Arrays - Signal Processing Techniques and Applications , 2001, Microphone Arrays.

[24] DeLiang Wang,et al. Ideal ratio mask estimation using deep neural networks for robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.