Bayesian Multichannel Speech Enhancement with a Deep Speech Prior

This paper describes statistical multichannel speech enhancement based on a deep generative model of speech spectra. Recently, deep neural networks (DNNs) have widely been used for converting noisy speech spectra to clean speech spectra or estimating time-frequency masks. Such a supervised approach, however, requires a sufficient amount of training data (pairs of noisy speech data and clean speech data) and often fails in an unseen noisy environment. This calls for a blind source separation method called multichannel nonnegative matrix factorization (MNMF) that can jointly estimate low-rank source spectra and spatial covariances on the fly. However, the assumption of low-rankness does not hold true for speech spectra. To solve these problems, we propose a semi-supervised method based on an extension of MNMF that consists of a deep generative model for speech spectra and a standard low-rank model for noise spectra. The speech model can be trained in advance with auto-encoding variational Bayes (AEVB) by using only clean speech data and is used as a prior of clean speech spectra for speech enhancement. Given noisy speech spectrogram, we estimate the posterior of clean speech spectra while estimating the noise model on the fly. Such adaptive estimation is achieved by using Gibbs sampling in a unified Bayesian framework. The experimental results showed the potential of the proposed method.

[1]  Tatsuya Kawahara,et al.  Statistical Speech Enhancement Based on Probabilistic Integration of Variational Autoencoder and Non-Negative Matrix Factorization , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Hirokazu Kameoka,et al.  Determined Blind Source Separation Unifying Independent Vector Analysis and Nonnegative Matrix Factorization , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Uri Shalit,et al.  Deep Kalman Filters , 2015, ArXiv.

[4]  Yonghong Yan,et al.  Ideal Ratio Mask Estimation Using Deep Neural Networks for Monaural Speech Segregation in Noisy Reverberant Conditions , 2017, INTERSPEECH.

[5]  Tomohiro Nakatani,et al.  Integrating DNN-based and spatial clustering-based mask estimation for robust MVDR beamforming , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Masataka Goto,et al.  Student's T nonnegative matrix factorization and positive semidefinite tensor factorization for single-channel audio source separation , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Jon Barker,et al.  The third 'CHiME' speech separation and recognition challenge: Analysis and outcomes , 2017, Comput. Speech Lang..

[8]  Daniel P. W. Ellis,et al.  MIR_EVAL: A Transparent Implementation of Common MIR Metrics , 2014, ISMIR.

[9]  Qin Zhang,et al.  Noise Reduction Based on Robust Principal Component Analysis , 2014 .

[10]  Shinnosuke Takamichi,et al.  Independent Deeply Learned Matrix Analysis for Multichannel Audio Source Separation , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[11]  Emmanuel Vincent,et al.  Multichannel Audio Source Separation With Deep Neural Networks , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Te-Won Lee,et al.  Independent Vector Analysis: An Extension of ICA to Multivariate Components , 2006, ICA.

[13]  Pierre Vandergheynst,et al.  Nonnegative matrix factorization and spatial covariance model for under-determined reverberant audio source separation , 2010, 10th International Conference on Information Science, Signal Processing and their Applications (ISSPA 2010).

[14]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Tatsuya Kawahara,et al.  Bayesian Multichannel Audio Source Separation Based on Integrated Source and Spatial Models , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Alexey Ozerov,et al.  Multichannel Nonnegative Matrix Factorization in Convolutive Mixtures for Audio Source Separation , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[18]  Hirokazu Kameoka,et al.  Multichannel Extensions of Non-Negative Matrix Factorization With Complex-Valued Data , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Bhiksha Raj,et al.  Speech denoising using nonnegative matrix factorization with priors , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Yu Tsao,et al.  Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[21]  Tuomas Virtanen,et al.  Multichannel audio separation by direction of arrival based spatial covariance model and non-negative matrix factorization , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Jonathan Le Roux,et al.  Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks , 2016, INTERSPEECH.