论文信息 - Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation

Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation

The recently-proposed deep clustering algorithm represents a fundamental advance towards solving the cocktail party problem in the single-channel case. When multiple microphones are available, spatial information can be leveraged to differentiate signals from different directions. This study combines spectral and spatial features in a deep clustering framework so that the complementary spectral and spatial information can be simultaneously exploited to improve speech separation. We find that simply encoding inter-microphone phase patterns as additional input features during deep clustering provides a significant improvement in separation performance, even with random microphone array geometry. Experiments on a spatial-ized version of the wsj0-2mix dataset show the strong potential of the proposed algorithm for speech separation in reverberant environments.

Zhong-Qiu Wang | Jonathan Le Roux | John R. Hershey | J. Hershey | Zhong-Qiu Wang

[1] Pasi Pertilä,et al. Distant speech separation using predicted time-frequency masks from spatial features , 2015, Speech Commun..

[2] Jean Rouat,et al. Blind Speech Separation and Enhancement With GCC-NMF , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3] Emmanuel Vincent,et al. A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4] Özgür Yilmaz,et al. On the approximate W-disjoint orthogonality of speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5] Reinhold Häb-Umbach,et al. Blind Acoustic Beamforming Based on Generalized Eigenvalue Decomposition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[6] Jonathan Le Roux,et al. Single-Channel Multi-Speaker Separation Using Deep Clustering , 2016, INTERSPEECH.

[7] Scott Rickard,et al. The DUET Blind Source Separation Algorithm , 2007, Blind Speech Separation.

[8] Tomohiro Nakatani,et al. Deep Clustering-Based Beamforming for Separation with Unknown Number of Sources , 2017, INTERSPEECH.

[9] Jont B. Allen,et al. Image method for efficiently simulating small‐room acoustics , 1976 .

[10] Jon Barker,et al. The third 'CHiME' speech separation and recognition challenge: Analysis and outcomes , 2017, Comput. Speech Lang..

[11] Zhuo Chen,et al. Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Nima Mesgarani,et al. Deep attractor network for single-microphone speaker separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13] Deliang Wang,et al. On Spatial Features for Supervised Speech Separation and its Application to Beamforming and Robust ASR , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] Reinhold Häb-Umbach,et al. BLSTM supported GEV beamformer front-end for the 3RD CHiME challenge , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[15] Reinhold Häb-Umbach,et al. Tight Integration of Spatial and Spectral Features for BSS with Deep Clustering Embeddings , 2017, INTERSPEECH.

[16] Yi Jiang,et al. Binaural Classification for Reverberant Speech Segregation Using Deep Neural Networks , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17] Dong Yu,et al. Multi-talker Speech Separation and Tracing with Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, ArXiv.

[18] Zhong-Qiu Wang,et al. Alternative Objective Functions for Deep Clustering , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Masakiyo Fujimoto,et al. Exploring multi-channel features for denoising-autoencoder-based speech enhancement , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20] Takuya Yoshioka,et al. Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] Daniel P. W. Ellis,et al. Model-Based Expectation-Maximization Source Separation and Localization , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[22] John R. Hershey,et al. Super-human multi-talker speech recognition: A graphical modeling approach , 2010, Comput. Speech Lang..

[23] Nima Mesgarani,et al. Speaker-Independent Speech Separation With Deep Attractor Network , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24] DeLiang Wang,et al. A speech enhancement algorithm by iterating single- and multi-microphone processing and its application to robust ASR , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25] Michael S. Brandstein,et al. Robust Localization in Reverberant Rooms , 2001, Microphone Arrays.

[26] Jacob Benesty,et al. The MVDR Beamformer for Speech Enhancement , 2010 .

[27] Jacob Benesty,et al. On Optimal Frequency-Domain Multichannel Linear Filtering for Noise Reduction , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[28] Jonathan Le Roux,et al. Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks , 2016, INTERSPEECH.

[29] Michael I. Jordan,et al. Learning Spectral Clustering, With Application To Speech Separation , 2006, J. Mach. Learn. Res..

[30] Chengzhu Yu,et al. The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[31] DeLiang Wang,et al. Deep Learning Based Binaural Speech Separation in Reverberant Environments , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[32] Hiroshi Sawada,et al. A Two-Stage Frequency-Domain Blind Source Separation Method for Underdetermined Convolutive Mixtures , 2007, 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[33] Tomohiro Nakatani,et al. Integrating DNN-based and spatial clustering-based mask estimation for robust MVDR beamforming , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).