论文信息 - Multi-Modal Multi-Channel Target Speech Separation

Multi-Modal Multi-Channel Target Speech Separation

Target speech separation refers to extracting a target speaker's voice from an overlapped audio of simultaneous talkers. Previously the use of visual modality for target speech separation has demonstrated great potentials. This work proposes a general multi-modal framework for target speech separation by utilizing all the available information of the target speaker, including his/her spatial location, voice characteristics and lip movements. Also, under this framework, we investigate on the fusion methods for multi-modal joint modeling. A factorized attention-based fusion method is proposed to aggregate the high-level semantic information of multi-modalities at embedding level. This method firstly factorizes the mixture audio into a set of acoustic subspaces, then leverages the target's information from other modalities to enhance these subspace acoustic embeddings with a learnable attention scheme. To validate the robustness of proposed multi-modal separation model in practical scenarios, the system was evaluated under the condition that one of the modalities is temporarily missing, invalid or corrupted. Experiments are conducted on a large-scale audio-visual dataset collected from YouTube (to be released) that spatialized by simulated room impulse responses (RIRs). Experiment results illustrate that our proposed multi-modal framework significantly outperforms single-modal and bi-modal speech separation approaches, while can still support real-time processing.

[1] Kevin Wilson,et al. Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[2] James Philbin,et al. FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Lei Xie,et al. Time Domain Audio Visual Speech Separation , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[4] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[5] Shmuel Peleg,et al. Seeing Through Noise: Visually Driven Speaker Separation And Enhancement , 2017, ICASSP.

[6] Joon Son Chung,et al. My lips are concealed: Audio-visual speech enhancement through obstructions , 2019, INTERSPEECH.

[7] Colin Cherry,et al. Contribution to a Study of the “Cocktail Party Problem” , 1960 .

[8] Joon Son Chung,et al. Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Joon Son Chung,et al. The Conversation: Deep Audio-Visual Speech Enhancement , 2018, INTERSPEECH.

[10] Xiong Xiao,et al. Efficient Integration of Fixed Beamformers and Speech Separation Networks for Multi-Channel Far-Field Speech Separation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] Tomohiro Nakatani,et al. Context adaptive deep neural networks for fast acoustic model adaptation , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Rémi Gribonval,et al. Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[13] E. Lehmann,et al. Prediction of energy decay in room impulse responses simulated with an image-source model. , 2008, The Journal of the Acoustical Society of America.

[14] Soumitro Chakrabarty,et al. Multi-Speaker DOA Estimation Using Deep Convolutional Networks Trained With Noise Signals , 2018, IEEE Journal of Selected Topics in Signal Processing.

[15] Tomohiro Nakatani,et al. Speaker-Aware Neural Network Based Beamformer for Speaker Extraction in Speech Mixtures , 2017, INTERSPEECH.

[16] Tomohiro Nakatani,et al. Learning speaker representation for neural network based multichannel speaker extraction , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[17] John H. L. Hansen,et al. Text-Independent Speaker Verification Based on Triplet Convolutional Neural Network Embeddings , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Erik Cambria,et al. Multimodal Sentiment Analysis using Hierarchical Fusion with Context Modeling , 2018, Knowl. Based Syst..

[20] John R. Hershey,et al. VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking , 2018, INTERSPEECH.

[21] Jonathan Le Roux,et al. SDR – Half-baked or Well Done? , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22] Zhiyao Duan,et al. Audio–Visual Deep Clustering for Speech Separation , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23] Yong Xu,et al. A comprehensive study of speech separation: spectrogram vs waveform separation , 2019, INTERSPEECH.

[24] Joon Son Chung,et al. Lip Reading in the Wild , 2016, ACCV.

[25] Dong Yu,et al. FACTORIZED DEEP NEURAL NETWORKS FOR ADAPTIVE SPEECH RECOGNITION , 2012 .

[26] Tomohiro Nakatani,et al. SpeakerBeam: Speaker Aware Neural Network for Target Speaker Extraction in Speech Mixtures , 2019, IEEE Journal of Selected Topics in Signal Processing.

[27] Olivier Galibert,et al. Methodologies for the evaluation of speaker diarization and automatic speech recognition in the presence of overlapping speech , 2013, INTERSPEECH.

[28] Zhong-Qiu Wang,et al. Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29] Jun Du,et al. Speech separation of a target speaker based on deep neural networks , 2014, 2014 12th International Conference on Signal Processing (ICSP).

[30] Andrew Owens,et al. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[31] Yifan Gong,et al. End-to-End attention based text-dependent speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[32] Zhuo Chen,et al. Single-channel Speech Extraction Using Speaker Inventory and Attention Network , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33] DeLiang Wang,et al. Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[34] Nima Mesgarani,et al. TasNet: Surpassing Ideal Time-Frequency Masking for Speech Separation. , 2018 .

[35] Jun Wang,et al. Deep Extractor Network for Target Speaker Recovery From Single Channel Speech Mixtures , 2018, INTERSPEECH.

[36] Jesper Jensen,et al. Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37] Xiong Xiao,et al. Multi-Channel Overlapped Speech Recognition with Location Guided Speech Extraction Network , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[38] DeLiang Wang,et al. On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[39] Zhuo Chen,et al. Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40] Nima Mesgarani,et al. Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[41] Dong Yu,et al. Multi-band PIT and Model Integration for Improved Multi-channel Speech Separation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42] Jonathan Le Roux,et al. Phase Reconstruction with Learned Time-Frequency Representations for Single-Channel Speech Separation , 2018, 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC).

[43] Deliang Wang,et al. On Spatial Features for Supervised Speech Separation and its Application to Beamforming and Robust ASR , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44] Joon Son Chung,et al. Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45] Davis E. King,et al. Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[46] Dong Yu,et al. Neural Spatial Filter: Target Speaker Speech Separation Assisted with Directional Information , 2019, INTERSPEECH.

[47] DeLiang Wang,et al. Combining Spectral and Spatial Features for Deep Learning Based Blind Speaker Separation , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[48] Shmuel Peleg,et al. Improved Speech Reconstruction from Silent Video , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[49] Zhong-Qiu Wang,et al. End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction , 2018, INTERSPEECH.

[50] Zhong-Qiu Wang,et al. Alternative Objective Functions for Deep Clustering , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).