Utterance-level Permutation Invariant Training with Discriminative Learning for Single Channel Speech Separation

The challenge in deep learning for speaker independent speech separation comes from the label ambiguity or permutation problem. Utterance-level permutation invariant training (uPIT) technique solves this problem by minimizing the mean square error (MSE) over all permutations between outputs and targets. It is a state-of-the-art deep learning architecture. However, uPIT only minimizes the chosen permutation with the lowest MSE, not discriminates it with other permutations. This may lead to increase the possibility of remixing the separated sources. In this paper, we propose a uPIT with discriminative learning (uPITDL) method to solve this problem by adding one regularization at the cost function. In other words, we minimize the difference between the outputs of model and their corresponding reference signals. Moreover, the dissimilarity between the prediction and the targets of other sources is maximized. We evaluate the proposed model on WSJ0-2mix dataset. Experimental results show 22.0% and 24.8% relative improvements under both closed and open conditions compared with the uPIT baseline.

[1]  Guy J. Brown,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2006 .

[2]  M. Wertheimer Laws of organization in perceptual forms. , 1938 .

[3]  Li Li,et al.  Deep Clustering with Gated Convolutional Networks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Zhuo Chen,et al.  Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Nima Mesgarani,et al.  Deep attractor network for single-microphone speaker separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Jesper Jensen,et al.  Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[8]  John J. Foxe,et al.  Attentional Selection in a Cocktail Party Environment Can Be Decoded from Single-Trial EEG. , 2015, Cerebral cortex.

[9]  Mikkel N. Schmidt,et al.  Single-channel speech separation using sparse non-negative matrix factorization , 2006, INTERSPEECH.

[10]  Mark D. Plumbley,et al.  Combining Mask Estimates for Single Channel Audio Source Separation Using Deep Neural Networks , 2016, INTERSPEECH.

[11]  Haizhou Li,et al.  Single Channel Speech Separation with Constrained Utterance Level Permutation Invariant Training Using Grid LSTM , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[13]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Jun Du,et al.  A Maximum Likelihood Approach to Deep Neural Network Based Nonlinear Spectral Mapping for Single-Channel Speech Separation , 2017, INTERSPEECH.

[15]  John R. Hershey,et al.  Monaural speech separation and recognition challenge , 2010, Comput. Speech Lang..

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Martin Cooke,et al.  Modelling auditory processing and organisation , 1993, Distinguished dissertations in computer science.

[19]  Paris Smaragdis,et al.  Deep learning for monaural speech separation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Paris Smaragdis,et al.  Singing-Voice Separation from Monaural Recordings using Deep Recurrent Neural Networks , 2014, ISMIR.

[21]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Daniel Patrick Whittlesey Ellis,et al.  Prediction-driven computational auditory scene analysis , 1996 .

[23]  Hugo Van hamme,et al.  Multi-Scenario Deep Learning for Multi-Speaker Source Separation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Jonathan Le Roux,et al.  Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).