论文信息 - Semi-supervised Multichannel Speech Enhancement with Variational Autoencoders and Non-negative Matrix Factorization

Semi-supervised Multichannel Speech Enhancement with Variational Autoencoders and Non-negative Matrix Factorization

In this paper we address speaker-independent multichannel speech enhancement in unknown noisy environments. Our work is based on a well-established multichannel local Gaussian modeling framework. We propose to use a neural network for modeling the speech spectro-temporal content. The parameters of this supervised model are learned using the framework of variational autoencoders. The noisy recording environment is supposed to be unknown, so the noise spectro-temporal modeling remains unsupervised and is based on non-negative matrix factorization (NMF). We develop a Monte Carlo expectation-maximization algorithm and we experimentally show that the proposed approach outperforms its NMF-based counterpart, where speech is modeled using supervised NMF.

[1] Emmanuel Vincent,et al. A General Flexible Framework for the Handling of Prior Information in Audio Source Separation , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[2] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[3] Alexey Ozerov,et al. Multichannel Nonnegative Matrix Factorization in Convolutive Mixtures for Audio Source Separation , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[4] Rémi Gribonval,et al. Under-Determined Reverberant Audio Source Separation Using a Full-Rank Spatial Covariance Model , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[5] G. C. Wei,et al. A Monte Carlo Implementation of the EM Algorithm and the Poor Man's Data Augmentation Algorithms , 1990 .

[6] Carla Teixeira Lopes,et al. TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[7] Antoine Liutkus,et al. Gaussian Processes for Underdetermined Source Separation , 2011, IEEE Transactions on Signal Processing.

[8] Andries P. Hekstra,et al. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[9] K. Chan,et al. Monte Carlo EM Estimation for Time Series Models Involving Counts , 1995 .

[10] Emmanuel Vincent,et al. Multichannel Audio Source Separation With Deep Neural Networks , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11] Nobutaka Ito,et al. The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings , 2013 .

[12] Rémi Gribonval,et al. Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[13] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14] DeLiang Wang,et al. Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15] Roland Badeau,et al. Multichannel Audio Source Separation With Probabilistic Reverberation Priors , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[17] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[18] Tatsuya Kawahara,et al. Statistical Speech Enhancement Based on Probabilistic Integration of Variational Autoencoder and Non-Negative Matrix Factorization , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Philipos C. Loizou,et al. Speech Enhancement: Theory and Practice , 2007 .

[20] Pierre Vandergheynst,et al. Nonnegative matrix factorization and spatial covariance model for under-determined reverberant audio source separation , 2010, 10th International Conference on Information Science, Signal Processing and their Applications (ISSPA 2010).

[21] James L. Massey,et al. Proper complex random processes with applications to information theory , 1993, IEEE Trans. Inf. Theory.

[22] Hirokazu Kameoka,et al. Multichannel Extensions of Non-Negative Matrix Factorization With Complex-Valued Data , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[23] Hirokazu Kameoka,et al. Determined Blind Source Separation Unifying Independent Vector Analysis and Nonnegative Matrix Factorization , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24] Christian P. Robert,et al. Monte Carlo Statistical Methods , 2005, Springer Texts in Statistics.

[25] Geoffrey E. Hinton,et al. Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[26] Radu Horaud,et al. Speech Enhancement with Variational Autoencoders and Alpha-stable Distributions , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27] Radu Horaud,et al. A VARIANCE MODELING FRAMEWORK BASED ON VARIATIONAL AUTOENCODERS FOR SPEECH ENHANCEMENT , 2018, 2018 IEEE 28th International Workshop on Machine Learning for Signal Processing (MLSP).

[28] F. Hansen,et al. Jensen's Operator Inequality , 2002, math/0204049.

[29] Tatsuya Kawahara,et al. Bayesian Multichannel Speech Enhancement with a Deep Speech Prior , 2018, 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[30] Jonathan G. Fiscus,et al. Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[31] Jesper Jensen,et al. An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[32] Nancy Bertin,et al. Nonnegative Matrix Factorization with the Itakura-Saito Divergence: With Application to Music Analysis , 2009, Neural Computation.