Gated Recurrent Fusion of Spatial and Spectral Features for Multi-Channel Speech Separation with Deep Embedding Representations

Multi-channel deep clustering (MDC) has acquired a good performance for speech separation. However, MDC only applies the spatial features as the additional information, which does not fuse them with the spectral features very well. So it is difficult to learn mutual relationship between spatial and spectral features. Besides, the training objective of MDC is defined at embedding vectors, rather than real separated sources, which may damage the separation performance. In this work, we deal with spatial and spectral features as two different modalities. We propose the gated recurrent fusion (GRF) method to adaptively select and fuse the relevant information from spectral and spatial features by making use of the gate and memory modules. In addition, to solve the training objective problem of MDC, the real separated sources are used as the training objectives. Specifically, we apply the deep clustering network to extract deep embedding features. Instead of using the unsupervised K-means clustering to estimate binary masks, another supervised network is utilized to learn soft masks from these deep embedding features. Our experiments are conducted on a spatialized reverberant version of WSJ0-2mix dataset. Experimental results show that the proposed method outperforms MDC baseline and even better than the oracle ideal binary mask (IBM).

[1]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[2]  Bin Liu,et al.  Spatial and spectral deep attention fusion for multi-channel speech separation using deep embedding features , 2020, ArXiv.

[3]  Jie Li,et al.  3D Gated Recurrent Fusion for Semantic Scene Completion , 2020, ArXiv.

[4]  Jonathan Le Roux,et al.  Single-Channel Multi-Speaker Separation Using Deep Clustering , 2016, INTERSPEECH.

[5]  Xiong Xiao,et al.  Multi-Channel Overlapped Speech Recognition with Location Guided Speech Extraction Network , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[6]  Dong Yu,et al.  Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Zhong-Qiu Wang,et al.  Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[10]  Jianhua Tao,et al.  Deep Attention Fusion Feature for Speech Separation with End-to-End Post-filter Method , 2020, ArXiv.

[11]  Bin Liu,et al.  Utterance-level Permutation Invariant Training with Discriminative Learning for Single Channel Speech Separation , 2018, 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[12]  Paris Smaragdis,et al.  Singing-Voice Separation from Monaural Recordings using Deep Recurrent Neural Networks , 2014, ISMIR.

[13]  Mark D. Plumbley,et al.  Combining Mask Estimates for Single Channel Audio Source Separation Using Deep Neural Networks , 2016, INTERSPEECH.

[14]  Dong Yu,et al.  Neural Spatial Filter: Target Speaker Speech Separation Assisted with Directional Information , 2019, INTERSPEECH.

[15]  Jesper Jensen,et al.  A short-time objective intelligibility measure for time-frequency weighted noisy speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Jianhua Tao,et al.  End-to-End Post-Filter for Speech Separation With Deep Attention Fusion Features , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Antony William Rix,et al.  Perceptual evaluation of speech quality (PESQ): The new ITU standard for end-to-end speech quality a , 2002 .

[19]  Zhong-Qiu Wang,et al.  Integrating Spectral and Spatial Features for Multi-Channel Speaker Separation , 2018, INTERSPEECH.

[20]  Masahito Togami,et al.  Spatial Constraint on Multi-channel Deep Clustering , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  John J. Foxe,et al.  Attentional Selection in a Cocktail Party Environment Can Be Decoded from Single-Trial EEG. , 2015, Cerebral cortex.

[22]  Jiangyan Yi,et al.  Discriminative Learning for Monaural Speech Separation Using Deep Embedding Features , 2019, INTERSPEECH.

[23]  Zhuo Chen,et al.  Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  Jesper Jensen,et al.  Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  DeLiang Wang,et al.  Combining Spectral and Spatial Features for Deep Learning Based Blind Speaker Separation , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.