Permutation invariant training of deep models for speaker-independent multi-talker speech separation

We propose a novel deep learning training criterion, named permutation invariant training (PIT), for speaker independent multi-talker speech separation, commonly known as the cocktail-party problem. Different from the multi-class regression technique and the deep clustering (DPCL) technique, our novel approach minimizes the separation error directly. This strategy effectively solves the long-lasting label permutation problem, that has prevented progress on deep learning based techniques for speech separation. We evaluated PIT on the WSJ0 and Danish mixed-speech separation tasks and found that it compares favorably to non-negative matrix factorization (NMF), computational auditory scene analysis (CASA), and DPCL and generalizes well over unseen speakers and languages. Since PIT is simple to implement and can be easily integrated and combined with other advanced techniques, we believe improvements built upon PIT can eventually solve the cocktail-party problem.

[1]  Jonathan Le Roux,et al.  Single-Channel Multi-Speaker Separation Using Deep Clustering , 2016, INTERSPEECH.

[2]  CookeMartin,et al.  Monaural speech separation and recognition challenge , 2010 .

[3]  Zhuo Chen,et al.  Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Enhong Chen,et al.  Learning Deep Representations for Graph Clustering , 2014, AAAI.

[5]  Björn W. Schuller,et al.  Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR , 2015, LVA/ICA.

[6]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[8]  John R. Hershey,et al.  Monaural speech separation and recognition challenge , 2010, Comput. Speech Lang..

[9]  Dong Yu,et al.  Roles of Pre-Training and Fine-Tuning in Context-Dependent DBN-HMMs for Real-World Speech Recognition , 2010 .

[10]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Tuomas Virtanen,et al.  Speech recognition using factorial hidden Markov models for separation in the feature space , 2006, INTERSPEECH.

[12]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  John R. Hershey,et al.  Super-human multi-talker speech recognition: the IBM 2006 speech separation challenge system , 2006, INTERSPEECH.

[14]  Michael I. Jordan,et al.  Factorial Hidden Markov Models , 1995, Machine Learning.

[15]  Le Roux Sparse NMF – half-baked or well done? , 2015 .

[16]  Daniel Patrick Whittlesey Ellis,et al.  Prediction-driven computational auditory scene analysis , 1996 .

[17]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  M. Wertheimer Laws of organization in perceptual forms. , 1938 .

[19]  Geoffrey Zweig,et al.  An introduction to computational networks and the computational network toolkit (invited talk) , 2014, INTERSPEECH.

[20]  Daniel P. W. Ellis,et al.  Monaural Speech Separation using Source-Adapted Models , 2007, 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[21]  E. C. Cmm,et al.  on the Recognition of Speech, with , 2008 .

[22]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[23]  I. Nelken Demonstrations of Auditory Scene Analysis: The Perceptual Organization of Sound by Albert S. Bregman and Pierre A. Ahad, MIT Press, 1996. £15.95 CD , 1997, Trends in Neurosciences.

[24]  Dong Yu,et al.  Deep Neural Networks for Single-Channel Multi-Talker Speech Recognition , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  Martin Cooke,et al.  Modelling auditory processing and organisation , 1993, Distinguished dissertations in computer science.

[26]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[27]  Paris Smaragdis,et al.  Convolutive Speech Bases and Their Application to Supervised Speech Separation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  George Saon,et al.  The IBM 2016 English Conversational Telephone Speech Recognition System , 2016, INTERSPEECH.

[29]  Paris Smaragdis,et al.  Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[30]  Yoshitaka Nakajima,et al.  Auditory Scene Analysis: The Perceptual Organization of Sound Albert S. Bregman , 1992 .

[31]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[32]  Wei Wang,et al.  Deep Embedding Network for Clustering , 2014, 2014 22nd International Conference on Pattern Recognition.

[33]  Mikkel N. Schmidt,et al.  Single-channel speech separation using sparse non-negative matrix factorization , 2006, INTERSPEECH.