Monoaural Audio Source Separation Using Variational Autoencoders

We introduce a monaural audio source separation framework using a latent generative model. Traditionally, discriminative training for source separation is proposed using deep neural networks or non-negative matrix factorization. In this paper, we propose a principled generative approach using variational autoencoders (VAE) for audio source separation. VAE computes efficient Bayesian inference which leads to a continuous latent representation of the input data(spectrogram). It contains a probabilistic encoder which projects an input data to latent space and a probabilistic decoder which projects data from latent space back to input space. This allows us to learn a robust latent representation of sources corrupted with noise and other sources. The latent representation is then fed to the decoder to yield the separated source. Both encoder and decoder are implemented via multilayer perceptron (MLP). In contrast to prevalent techniques, we argue that VAE is a more principled approach to source separation. Experimentally, we find that the proposed framework yields reasonable improvements when compared to baseline methods available in the literature i.e. DNN and RNN with different masking functions and autoencoders. We show that our method performs better than best of the relevant methods with∼ 2 dB improvement in the source to distortion ratio. Index Terms Autoencoder, Variational inference, Latent variable, Source separation, Generative models, Deep learning *

[1]  Paris Smaragdis,et al.  Adaptive Denoising Autoencoders: A Fine-Tuning Scheme to Learn from Test Mixtures , 2015, LVA/ICA.

[2]  Paris Smaragdis,et al.  Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Paris Smaragdis,et al.  A State-Space Approach to Dynamic Nonnegative Matrix Factorization , 2015, IEEE Transactions on Signal Processing.

[4]  Rémi Gribonval,et al.  BSS_EVAL Toolbox User Guide -- Revision 2.0 , 2005 .

[5]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[6]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[7]  Jonathan Le Roux,et al.  Discriminative NMF and its application to single-channel source separation , 2014, INTERSPEECH.

[8]  Jinwon Lee,et al.  A Fully Convolutional Neural Network for Speech Enhancement , 2016, INTERSPEECH.

[9]  Yu Tsao,et al.  Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[10]  Carl Doersch,et al.  Tutorial on Variational Autoencoders , 2016, ArXiv.

[11]  Bhiksha Raj,et al.  Latent Dirichlet Decomposition for Single Channel Speaker Separation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[12]  Emmanuel Vincent,et al.  Multichannel Audio Source Separation With Deep Neural Networks , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[14]  Bhiksha Raj,et al.  A Probabilistic Latent Variable Model for Acoustic Modeling , 2006 .

[15]  Paris Smaragdis,et al.  Static and Dynamic Source Separation Using Nonnegative Factorizations: A unified view , 2014, IEEE Signal Processing Magazine.

[16]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[17]  Mark D. Plumbley,et al.  Single channel audio source separation using convolutional denoising autoencoders , 2017, 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[18]  Tuomas Virtanen,et al.  Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[20]  Alexander G. Schwing,et al.  Creativity: Generating Diverse Questions Using Variational Autoencoders , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Paris Smaragdis,et al.  Deep learning for monaural speech separation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).