Variational Recurrent Neural Networks for Speech Separation

We present a new stochastic learning machine for speech separation based on the variational recurrent neural network (VRNN). This VRNN is constructed from the perspectives of generative stochastic network and variational auto-encoder. The idea is to faithfully characterize the randomness of hidden state of a recurrent neural network through variational learning. The neural parameters under this latent variable model are estimated by maximizing the variational lower bound of log marginal likelihood. An inference network driven by the variational distribution is trained from a set of mixed signals and the associated source targets. A novel supervised VRNN is developed for speech separation. The proposed VRNN provides a stochastic point of view which accommodates the uncertainty in hidden states and facilitates the analysis of model construction. The masking function is further employed in network outputs for speech separation. The benefit of using VRNN is demonstrated by the experiments on monaural speech separation.

[1]  Jen-Tzung Chien,et al.  Bayesian Speech and Language Processing , 2015 .

[2]  Jen-Tzung Chien,et al.  Bayesian Recurrent Neural Network for Language Modeling , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[3]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[4]  Tuomas Virtanen,et al.  Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Jen-Tzung Chien,et al.  Discriminative deep recurrent neural networks for monaural speech separation , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7]  DeLiang Wang,et al.  Robust speech recognition by integrating speech separation and hypothesis testing , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[8]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[9]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[10]  Jonathan Le Roux,et al.  Discriminative NMF and its application to single-channel source separation , 2014, INTERSPEECH.

[11]  Jen-Tzung Chien,et al.  Nonstationary Source Separation Using Sequential and Variational Bayesian Learning , 2013, IEEE Transactions on Neural Networks and Learning Systems.

[12]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[13]  Jen-Tzung Chien,et al.  Convex Divergence ICA for Blind Source Separation , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Jen-Tzung Chien,et al.  Bayesian Factorization and Learning for Monaural Source Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  DeLiang Wang,et al.  Ideal ratio mask estimation using deep neural networks for robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Rémi Gribonval,et al.  Adaptation of Bayesian Models for Single-Channel Source Separation and its Application to Voice/Music Separation in Popular Songs , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Yoshua Bengio,et al.  Deep Generative Stochastic Networks Trainable by Backprop , 2013, ICML.

[18]  Björn W. Schuller,et al.  Discriminatively trained recurrent neural networks for single-channel speech separation , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[19]  Phil Blunsom,et al.  Neural Variational Inference for Text Processing , 2015, ICML.

[20]  Jen-Tzung Chien,et al.  A new independent component analysis for speech recognition and separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Hakan Erdogan,et al.  Deep neural networks for single channel source separation , 2013, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Mikkel N. Schmidt,et al.  Single-channel speech separation using sparse non-negative matrix factorization , 2006, INTERSPEECH.

[23]  Yoshua Bengio,et al.  A Recurrent Latent Variable Model for Sequential Data , 2015, NIPS.

[24]  Paris Smaragdis,et al.  Deep learning for monaural speech separation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Simon Haykin,et al.  The Cocktail Party Problem , 2005, Neural Computation.

[26]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.