Deep neural networks for single channel source separation

In this paper, a novel approach for single channel source separation (SCSS) using a deep neural network (DNN) architecture is introduced. Unlike previous studies in which DNN and other classifiers were used for classifying time-frequency bins to obtain hard masks for each source, we use the DNN to classify estimated source spectra to check for their validity during separation. In the training stage, the training data for the source signals are used to train a DNN. In the separation stage, the trained DNN is utilized to aid in estimation of each source in the mixed signal. Single channel source separation problem is formulated as an energy minimization problem where each source spectra estimate is encouraged to fit the trained DNN model and the mixed signal spectrum is encouraged to be written as a weighted sum of the estimated source spectra. The proposed approach works regardless of the energy scale differences between the source signals in the training and separation stages. Nonnegative matrix factorization (NMF) is used to initialize the DNN estimate for each source. The experimental results show that using DNN initialized by NMF for source separation improves the quality of the separated signal compared with using NMF for source separation.

[1]  Tuomas Virtanen,et al.  Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  DeLiang Wang,et al.  Cocktail Party Processing via Structured Prediction , 2012, NIPS.

[3]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[4]  Hakan Erdogan,et al.  Regularized nonnegative matrix factorization using Gaussian mixture priors for supervised single channel source separation , 2013, Comput. Speech Lang..

[5]  Hakan Erdogan,et al.  Spectro-temporal post-enhancement using MMSE estimation in NMF based single-channel source separation , 2013, INTERSPEECH.

[6]  Tuomas Virtanen,et al.  Speech recognition using factorial hidden Markov models for separation in the feature space , 2006, INTERSPEECH.

[7]  Emad M. Grais,et al.  Single channel speech music separation using nonnegative matrix factorization and spectral masks , 2011, 2011 17th International Conference on Digital Signal Processing (DSP).

[8]  Mikkel N. Schmidt,et al.  Single-channel speech separation using sparse non-negative matrix factorization , 2006, INTERSPEECH.

[9]  Mark Hasegawa-Johnson,et al.  A factorial HMM approach to simultaneous recognition of isolated digits spoken by multiple talkers on one audio channel , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  John R. Hershey,et al.  Super-human multi-talker speech recognition: A graphical modeling approach , 2010, Comput. Speech Lang..

[11]  Bhiksha Raj,et al.  Soft Mask Methods for Single-Channel Speaker Separation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Richard M. Dansereau,et al.  Scaled factorial hidden Markov models: A new technique for compensating gain differences in model-based single channel speech separation , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  M.H. Radfar,et al.  Gain estimation in model-based single channel speech separation , 2007, 2009 IEEE International Workshop on Machine Learning for Signal Processing.

[15]  John R. Hershey,et al.  Single microphone source separation using high resolution signal reconstruction , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Hakan Erdogan,et al.  Single Channel Speech Music Separation Using Nonnegative Matrix Factorization with Sliding Windows and Spectral Masks , 2011, INTERSPEECH.

[17]  Sam T. Roweis,et al.  One Microphone Source Separation , 2000, NIPS.

[18]  Maurice Charbit,et al.  Factorial Scaled Hidden Markov Model for polyphonic audio representation and source separation , 2009, 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[19]  Hakan Erdogan,et al.  Spectro-temporal post-smoothing in NMF based single-channel source separation , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[20]  Hakan Erdogan,et al.  Gaussian Mixture Gain Priors for Regularized Nonnegative Matrix Factorization in Single-Channel Source Separation , 2012, INTERSPEECH.

[21]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[22]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[23]  Hakan Erdogan,et al.  Hidden Markov Models as Priors for Regularized Nonnegative Matrix Factorization in Single-Channel Source Separation , 2012, INTERSPEECH.

[24]  Nancy Bertin,et al.  Nonnegative Matrix Factorization with the Itakura-Saito Divergence: With Application to Music Analysis , 2009, Neural Computation.

[25]  Richard M. Dansereau,et al.  Monaural Speech Separation Based on Gain Adapted Minimum Mean Square Error Estimation , 2010, J. Signal Process. Syst..

[26]  Bhiksha Raj,et al.  A minimum mean squared error estimator for single channel speaker separation , 2004, INTERSPEECH.

[27]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[28]  Richard M. Dansereau,et al.  Long-Term Gain Estimation in Model-Based Single Channel Speech Separation , 2007 .