Discriminative deep recurrent neural networks for monaural speech separation

Deep neural network is now a new trend towards solving different problems in speech processing. In this paper, we propose a discriminative deep recurrent neural network (DRNN) model for monaural speech separation. Our idea is to construct DRNN as a regression model to discover the deep structure and regularity for signal reconstruction from a mixture of two source spectra. To reinforce the discrimination capability between two separated spectra, we estimate DRNN separation parameters by minimizing an integrated objective function which consists of two measurements. One is the within-source reconstruction errors due to the individual source spectra while the other conveys the discrimination information which preserves the mutual difference between two source spectra during the supervised training procedure. This discrimination information acts as a kind of regularization so as to maintain between-source separation in monaural source separation. In the experiments, we demonstrate the effectiveness of the proposed method for speech separation compared with the other methods.

[1]  Franz Pernkopf,et al.  General Stochastic Networks for Classification , 2014, NIPS.

[2]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[4]  Rémi Gribonval,et al.  Adaptation of Bayesian Models for Single-Channel Source Separation and its Application to Voice/Music Separation in Popular Songs , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Victor Zue,et al.  Speech database development at MIT: Timit and beyond , 1990, Speech Commun..

[6]  Jonathan Le Roux,et al.  Discriminative NMF and its application to single-channel source separation , 2014, INTERSPEECH.

[7]  DeLiang Wang,et al.  Ideal ratio mask estimation using deep neural networks for robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Paris Smaragdis,et al.  Deep learning for monaural speech separation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Simon Haykin,et al.  The Cocktail Party Problem , 2005, Neural Computation.

[10]  Mikkel N. Schmidt,et al.  Single-channel speech separation using sparse non-negative matrix factorization , 2006, INTERSPEECH.

[11]  Quoc V. Le,et al.  On optimization methods for deep learning , 2011, ICML.

[12]  Tuomas Virtanen,et al.  Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[14]  Jen-Tzung Chien,et al.  A new independent component analysis for speech recognition and separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Hakan Erdogan,et al.  Deep neural networks for single channel source separation , 2013, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Jen-Tzung Chien,et al.  Convex Divergence ICA for Blind Source Separation , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[18]  Jen-Tzung Chien,et al.  Bayesian Factorization and Learning for Monaural Source Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Björn W. Schuller,et al.  Discriminatively trained recurrent neural networks for single-channel speech separation , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[20]  Jen-Tzung Chien,et al.  Nonstationary Source Separation Using Sequential and Variational Bayesian Learning , 2013, IEEE Transactions on Neural Networks and Learning Systems.

[21]  DeLiang Wang,et al.  Robust speech recognition by integrating speech separation and hypothesis testing , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..