论文信息 - Deep clustering: Discriminative embeddings for segmentation and separation

Deep clustering: Discriminative embeddings for segmentation and separation

We address the problem of "cocktail-party" source separation in a deep learning framework called deep clustering. Previous deep network approaches to separation have shown promising performance in scenarios with a fixed number of sources, each belonging to a distinct signal class, such as speech and noise. However, for arbitrary source classes and number, "class-based" methods are not suitable. Instead, we train a deep network to assign contrastive embedding vectors to each time-frequency region of the spectrogram in order to implicitly predict the segmentation labels of the target spectrogram from the input mixtures. This yields a deep network-based analogue to spectral clustering, in that the embeddings form a low-rank pair-wise affinity matrix that approximates the ideal affinity matrix, while enabling much faster performance. At test time, the clustering step "decodes" the segmentation implicit in the embeddings by optimizing K-means with respect to the unknown assignments. Preliminary experiments on single-channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker mixtures can improve signal quality for mixtures of held-out speakers by an average of 6dB. More dramatically, the same model does surprisingly well with three-speaker mixtures.

[1] John R. Hershey,et al. Monaural speech separation and recognition challenge , 2010, Comput. Speech Lang..

[2] Enhong Chen,et al. Learning Deep Representations for Graph Clustering , 2014, AAAI.

[3] Jitendra Malik,et al. Spectral grouping using the Nystrom method , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4] Yang Lu,et al. An algorithm that improves speech intelligibility in noise for normal-hearing listeners. , 2009, The Journal of the Acoustical Society of America.

[5] DeLiang Wang,et al. On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[6] Michael I. Jordan,et al. On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[7] Björn W. Schuller,et al. Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR , 2015, LVA/ICA.

[8] Ron J. Weiss. Underdetermined source separation using speaker subspace models , 2009 .

[9] C. Darwin. Auditory grouping , 1997, Trends in Cognitive Sciences.

[10] Jonathan Le Roux,et al. Deep Unfolding: Model-Based Inspiration of Novel Deep Architectures , 2014, ArXiv.

[11] Yoshitaka Nakajima,et al. Auditory Scene Analysis: The Perceptual Organization of Sound Albert S. Bregman , 1992 .

[12] DeLiang Wang,et al. Exploring Monaural Features for Classification-Based Speech Segregation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[13] Martin Cooke,et al. Modelling auditory processing and organisation , 1993, Distinguished dissertations in computer science.

[14] DeLiang Wang,et al. Towards Scaling Up Classification-Based Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[15] G. Kramer. Auditory Scene Analysis: The Perceptual Organization of Sound by Albert Bregman (review) , 2016 .

[16] Jitendra Malik,et al. Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[17] John R. Hershey,et al. Single-Channel Multitalker Speech Recognition , 2010, IEEE Signal Processing Magazine.

[18] M. Cugmas,et al. On comparing partitions , 2015 .

[19] DeLiang Wang,et al. An Unsupervised Approach to Cochannel Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[20] Paris Smaragdis,et al. Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21] Björn W. Schuller,et al. Single-channel speech separation with memory-enhanced recurrent neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22] Jitendra Malik,et al. Learning affinity functions for image segmentation: combining patch-based and gradient-based approaches , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[23] DeLiang Wang,et al. On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24] Marina Meila,et al. Local equivalences of distances between clusterings—a geometric perspective , 2012, Machine Learning.

[25] Rémi Gribonval,et al. Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[26] Le Roux. Sparse NMF – half-baked or well done? , 2015 .

[27] Daniel Patrick Whittlesey Ellis,et al. Prediction-driven computational auditory scene analysis , 1996 .

[28] Mikkel N. Schmidt,et al. Single-channel speech separation using sparse non-negative matrix factorization , 2006, INTERSPEECH.

[29] Daniel P. W. Ellis,et al. Estimating single-channel source separation masks: relevance vector machine classifiers vs. pitch-based masking , 2006, SAPA@INTERSPEECH.

[30] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[31] WangDeLiang,et al. Towards Scaling Up Classification-Based Speech Separation , 2013 .

[32] Wei Wang,et al. Deep Embedding Network for Clustering , 2014, 2014 22nd International Conference on Pattern Recognition.

[33] C. K. Ogden. A Source Book Of Gestalt Psychology , 2013 .

[34] M. Wertheimer. Laws of organization in perceptual forms. , 1938 .

[35] John R. Hershey,et al. Super-human multi-talker speech recognition: A graphical modeling approach , 2010, Comput. Speech Lang..

[36] Ming-Yu Liu,et al. Recursive Context Propagation Network for Semantic Scene Labeling , 2014, NIPS.

[37] Marina Meil,et al. The stability of a good clustering , 2011 .

[38] DeLiang Wang,et al. An iterative model-based approach to cochannel speech separation , 2013, EURASIP J. Audio Speech Music. Process..

[39] Sam T. Roweis,et al. Factorial models and refiltering for speech separation and denoising , 2003, INTERSPEECH.

[40] Fabian J. Theis,et al. The signal separation evaluation campaign (2007-2010): Achievements and remaining challenges , 2012, Signal Process..

[41] Tuomas Virtanen,et al. Speech recognition using factorial hidden Markov models for separation in the feature space , 2006, INTERSPEECH.

[42] Camille Couprie,et al. Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43] Michael I. Jordan,et al. Learning Spectral Clustering, With Application To Speech Separation , 2006, J. Mach. Learn. Res..

[44] Jun Du,et al. An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[45] Paris Smaragdis,et al. Convolutive Speech Bases and Their Application to Supervised Speech Separation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.