Underdetermined Source Separation Based on Generalized Multichannel Variational Autoencoder

This paper deals with a multichannel audio source separation problem under underdetermined conditions. Multichannel non-negative matrix factorization (MNMF) is a powerful method for underdetermined audio source separation, which adopts the NMF concept to model and estimate the power spectrograms of the sound sources in a mixture signal. This concept is also used in independent low-rank matrix analysis (ILRMA), a special class of the MNMF formulated under determined conditions. While these methods work reasonably well for particular types of sound sources, one limitation is that they can fail to work for sources with spectrograms that do not comply with the NMF model. To address this limitation, an extension of ILRMA called the multichannel variational autoencoder (MVAE) method was recently proposed, where a conditional VAE (CVAE) is used instead of the NMF model for expressing source power spectrograms. This approach has performed impressively in determined source separation tasks thanks to the representation power of deep neural networks. While the original MVAE method was formulated under determined mixing conditions, this paper proposes a generalized version of it by combining the ideas of MNMF and MVAE so that it can also deal with underdetermined cases. We call this method the generalized MVAE (GMVAE) method. In underdetermined source separation and speech enhancement experiments, the proposed method performed better than baseline methods.

[1]  Kazuyoshi Yoshii,et al.  Correlated Tensor Factorization for Audio Source Separation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Katsutoshi Itoyama,et al.  Student's t multichannel nonnegative matrix factorization for blind source separation , 2016, 2016 IEEE International Workshop on Acoustic Signal Enhancement (IWAENC).

[3]  Nobutaka Ono,et al.  Stable and fast update rules for independent vector analysis based on auxiliary function technique , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[4]  Alexey Ozerov,et al.  Multichannel Nonnegative Matrix Factorization in Convolutive Mixtures for Audio Source Separation , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  J. Cardoso,et al.  Maximum likelihood approach for blind audio source separation using time-frequency Gaussian source models , 2005, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005..

[6]  Hirokazu Kameoka,et al.  Determined Blind Source Separation Unifying Independent Vector Analysis and Nonnegative Matrix Factorization , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Emmanuel Vincent,et al.  Multichannel Audio Source Separation With Deep Neural Networks , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  H. Kameoka,et al.  Determined Blind Source Separation with Independent Low-Rank Matrix Analysis , 2018 .

[9]  Junichi Yamagishi,et al.  The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods , 2018, Odyssey.

[10]  Hirokazu Kameoka,et al.  Statistical Model of Speech Signals Based on Composite Autoregressive System with Application to Blind Source Separation , 2010, LVA/ICA.

[11]  Te-Won Lee,et al.  Independent Vector Analysis: An Extension of ICA to Multivariate Components , 2006, ICA.

[12]  Li Li,et al.  Generalized Multichannel Variational Autoencoder for Underdetermined Source Separation , 2019, 2019 27th European Signal Processing Conference (EUSIPCO).

[13]  Tatsuya Kawahara,et al.  Bayesian Multichannel Speech Enhancement with a Deep Speech Prior , 2018, 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  P. Smaragdis,et al.  Non-negative matrix factorization for polyphonic music transcription , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[16]  Hirokazu Kameoka,et al.  General Formulation of Multichannel Extensions of NMF Variants , 2018 .

[17]  Hirokazu Kameoka,et al.  Supervised Determined Source Separation with Multichannel Variational Autoencoder , 2019, Neural Computation.

[18]  Nancy Bertin,et al.  Nonnegative Matrix Factorization with the Itakura-Saito Divergence: With Application to Music Analysis , 2009, Neural Computation.

[19]  Rémi Gribonval,et al.  Underdetermined Instantaneous Audio Source Separation via Local Gaussian Modeling , 2009, ICA.

[20]  Radu Horaud,et al.  Semi-supervised Multichannel Speech Enhancement with Variational Autoencoders and Non-negative Matrix Factorization , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Mark D. Plumbley,et al.  Probabilistic Modeling Paradigms for Audio Source Separation , 2010 .

[22]  Atsuo Hiroe,et al.  Solution of Permutation Problem in Frequency Domain ICA, Using Multivariate Probability Density Functions , 2006, ICA.

[23]  Hiroshi Saruwatari,et al.  Independent low-rank matrix analysis based on complex student's t-distribution for blind audio source separation , 2017, 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).

[24]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[25]  Li Li,et al.  Semi-blind source separation with multichannel variational autoencoder , 2018, ArXiv.

[26]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[27]  D. Hunter,et al.  A Tutorial on MM Algorithms , 2004 .

[28]  Hirokazu Kameoka,et al.  Multichannel Extensions of Non-Negative Matrix Factorization With Complex-Valued Data , 2013, IEEE Transactions on Audio, Speech, and Language Processing.