论文信息 - Multi-band PIT and Model Integration for Improved Multi-channel Speech Separation

Multi-band PIT and Model Integration for Improved Multi-channel Speech Separation

The recent exploration of deep learning for supervised speech separation has significantly accelerated the progress on the multi-talker speech separation problem. Multi-channel extension has attracted much research attention due to the benefit of spatial information in far-field acoustic environments. In this paper, We review the most recent models of multi-channel permutation invariant training (PIT), investigate spatial features formed by microphone pairs and their underlying impact and issue, present a multi-band architecture for effective feature encoding, and conduct a model integration between single-channel and multi-channel PIT for resolving the spatial overlapping problem in the conventional multi-channel PIT framework. The evaluation confirms the significant improvement achieved with the proposed model and training approach for the multi-channel speech separation.

[1] Hakan Erdogan,et al. Multi-Microphone Neural Speech Separation for Far-Field Multi-Talker Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Dong Yu,et al. Past review, current progress, and challenges ahead on the cocktail party problem , 2018, Frontiers of Information Technology & Electronic Engineering.

[3] E. C. Cmm,et al. on the Recognition of Speech, with , 2008 .

[4] DeLiang Wang,et al. Speech segregation based on sound localization , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[5] Nima Mesgarani,et al. Speaker-Independent Speech Separation With Deep Attractor Network , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6] Jonathan Le Roux,et al. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] Zhong-Qiu Wang,et al. Integrating Spectral and Spatial Features for Multi-Channel Speaker Separation , 2018, INTERSPEECH.

[8] Scott Rickard,et al. Blind separation of speech mixtures via time-frequency masking , 2004, IEEE Transactions on Signal Processing.

[9] Hiroshi Sawada,et al. A Two-Stage Frequency-Domain Blind Source Separation Method for Underdetermined Convolutive Mixtures , 2007, 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[10] Xiong Xiao,et al. Efficient Integration of Fixed Beamformers and Speech Separation Networks for Multi-Channel Far-Field Speech Separation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] Jesper Jensen,et al. Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Özgür Yilmaz,et al. Blind separation of disjoint orthogonal signals: demixing N sources from 2 mixtures , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[13] Zhuo Chen,et al. Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] Nima Mesgarani,et al. Deep attractor network for single-microphone speaker separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Chris J. Bleakley,et al. DOA estimation of multiple sparse sources using three widely-spaced sensors , 2009, 2009 17th European Signal Processing Conference.

[16] Zhong-Qiu Wang,et al. Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17] Jonathan Le Roux,et al. Single-Channel Multi-Speaker Separation Using Deep Clustering , 2016, INTERSPEECH.

[18] Rémi Gribonval,et al. Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[19] Liang Lu,et al. Deep beamforming networks for multi-channel speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20] Jont B. Allen,et al. Image method for efficiently simulating small‐room acoustics , 1976 .