Online Multichannel Speech Enhancement Based on Recursive EM and DNN-Based Speech Presence Estimation

This article presents a recursive expectation-maximization algorithm for online multichannel speech enhancement. A deep neural network mask estimator is used to compute the speech presence probability, which is then improved by means of statistical spatial models of the noisy speech and noise signals. The clean speech signal is estimated using beamforming, single-channel linear postfiltering and speech presence masking. The clean speech statistics and speech presence probabilities are finally used to compute the acoustic parameters for beamforming and postfiltering by means of maximum likelihood estimation. This iterative procedure is carried out on a frame-by-frame basis. The algorithm integrates the different estimates in a common statistical framework suitable for online scenarios. Moreover, our method can successfully exploit spectral, spatial and temporal speech properties. Our proposed algorithm is tested in different noisy environments using the multichannel recordings of the CHiME-4 database. The experimental results show that our method outperforms other related state-of-the-art approaches in noise reduction performance, while allowing low-latency processing for real-time applications.

[1]  Richard C. Hendriks,et al.  Noise Correlation Matrix Estimation for Multi-Microphone Speech Enhancement , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Jesper Jensen,et al.  Noise Power Spectrum Estimation for Speech Enhancement Using an Autoregressive Model for Speech Power Spectrum Dynamics , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[3]  Emanuel A. P. Habets,et al.  Time–Frequency Masking Based Online Multi-Channel Speech Enhancement With Convolutional Recurrent Neural Networks , 2019, IEEE Journal of Selected Topics in Signal Processing.

[4]  Reinhold Häb-Umbach,et al.  Neural network based spectral mask estimation for acoustic beamforming , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Emmanuel Vincent,et al.  A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  Emanuel A. P. Habets,et al.  Two Model-Based EM Algorithms for Blind Source Separation in Noisy Environments , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Kuldip K. Paliwal,et al.  Modulation-domain Kalman filtering for single-channel speech enhancement , 2011, Speech Commun..

[8]  Israel Cohen,et al.  Speech enhancement based on the general transfer function GSC and postfiltering , 2003, IEEE Transactions on Speech and Audio Processing.

[9]  Emanuel A. P. Habets,et al.  Online Speech Dereverberation Using Kalman Filter and EM Algorithm , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Tomohiro Nakatani,et al.  Frame-by-Frame Closed-Form Update for Mask-Based Adaptive MVDR Beamforming , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Franz Pernkopf,et al.  Eigenvector-Based Speech Mask Estimation for Multi-Channel Speech Enhancement , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Emanuël A. P. Habets,et al.  Linear Prediction-Based Online Dereverberation and Noise Reduction Using Alternating Kalman Filters , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Reinhold Häb-Umbach,et al.  A generic neural acoustic beamforming architecture for robust multi-channel speech processing , 2017, Comput. Speech Lang..

[14]  Jon Barker,et al.  An analysis of environment, microphone and data simulation mismatches in robust speech recognition , 2017, Comput. Speech Lang..

[15]  Mike Brookes,et al.  Modulation-Domain Multichannel Kalman Filtering for Speech Enhancement , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Reinhold Haeb-Umbach,et al.  Integration of Neural Networks and Probabilistic Spatial Models for Acoustic Blind Source Separation , 2019, IEEE Journal of Selected Topics in Signal Processing.

[17]  Tomohiro Nakatani,et al.  Complex angular central Gaussian mixture model for directional statistics in mask-based microphone array signal processing , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[18]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19]  Marc Moonen,et al.  Low-rank Approximation Based Multichannel Wiener Filter Algorithms for Noise Reduction with Application in Cochlear Implants , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20]  Rainer Martin,et al.  Improved A Posteriori Speech Presence Probability Estimation Based on a Likelihood Ratio With Fixed Priors , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Bhiksha Raj,et al.  Microphone Array Processing for Distant Speech Recognition: From Close-Talking Microphones to Far-Field Sensors , 2012, IEEE Signal Processing Magazine.

[22]  Israel Cohen,et al.  Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging , 2003, IEEE Trans. Speech Audio Process..

[23]  Shan Wang,et al.  An Expectation-Maximization Algorithm for Blind Separation of Noisy Mixtures Using Gaussian Mixture Model , 2016, Circuits, Systems, and Signal Processing.

[24]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[25]  Emanuel A. P. Habets,et al.  Nonstationary Noise PSD Matrix Estimation for Multichannel Blind Speech Extraction , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26]  Emmanuel Vincent,et al.  Multichannel Audio Source Separation With Deep Neural Networks , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27]  Marc Moonen,et al.  GSVD-based optimal filtering for single and multimicrophone speech enhancement , 2002, IEEE Trans. Signal Process..

[28]  Rémi Gribonval,et al.  Under-Determined Reverberant Audio Source Separation Using a Full-Rank Spatial Covariance Model , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Wei Xiao,et al.  Multi-channel noise reduction for hands-free voice communication on mobile phones , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Emanuel A. P. Habets,et al.  An Expectation-Maximization Algorithm for Multimicrophone Speech Dereverberation and Noise Reduction With Coherence Matrix Estimation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[31]  Ehud Weinstein,et al.  Signal enhancement using beamforming and nonstationarity with applications to speech , 2001, IEEE Trans. Signal Process..

[32]  Masahito Togami,et al.  Simultaneous Optimization of Acoustic Echo Reduction, Speech Dereverberation, and Noise Reduction against Mutual Interference , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[33]  Reinhold Haeb-Umbach,et al.  Smoothing along Frequency in Online Neural Network Supported Acoustic Beamforming , 2018, ITG Symposium on Speech Communication.

[34]  Jonathan Le Roux,et al.  SDR – Half-baked or Well Done? , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Jacob Benesty,et al.  An Integrated Solution for Online Multichannel Noise Tracking and Reduction , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[36]  Tomohiro Nakatani,et al.  Integrating DNN-based and spatial clustering-based mask estimation for robust MVDR beamforming , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Reinhold Haeb-Umbach,et al.  Multi-Channel Block-Online Source Extraction Based on Utterance Adaptation , 2019, INTERSPEECH.

[38]  Emmanuel Vincent,et al.  Blind Suppression of Nonstationary Diffuse Acoustic Noise Based on Spatial Covariance Matrix Decomposition , 2015, J. Signal Process. Syst..

[39]  Tomohiro Nakatani,et al.  Online MVDR Beamformer Based on Complex Gaussian Mixture Model With Spatial Prior for Noise Robust ASR , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[40]  Kevin Murphy,et al.  Switching Kalman Filters , 1998 .

[41]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[42]  Eric Moulines,et al.  On‐line expectation–maximization algorithm for latent data models , 2007, ArXiv.

[43]  Petros Maragos,et al.  A generalized estimation approach for linear and nonlinear microphone array post-filters , 2007, Speech Commun..

[44]  Sharon Gannot,et al.  A Recursive Expectation-Maximization Algorithm for Online Multi-Microphone Noise Reduction , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[45]  Sharon Gannot,et al.  Performance analysis of the covariance subtraction method for relative transfer function estimation and comparison to the covariance whitening method , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46]  Tomohiro Nakatani,et al.  Online Integration of DNN-Based and Spatial Clustering-Based Mask Estimation for Robust MVDR Beamforming , 2018, 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC).

[47]  Antonio M. Peinado,et al.  Dual-Channel Speech Enhancement Based on Extended Kalman Filter Relative Transfer Function Estimation , 2019, Applied Sciences.

[48]  J. Capon High-resolution frequency-wavenumber spectrum analysis , 1969 .

[49]  Jesper Jensen,et al.  An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[50]  Søren Holdt Jensen,et al.  Maximum Likelihood PSD Estimation for Speech Enhancement in Reverberation and Noise , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[51]  Mohsen Rahmani,et al.  Noise cross PSD estimation using phase information in diffuse noise field , 2009, Signal Process..

[52]  Yonghong Yan,et al.  Rank-1 constrained Multichannel Wiener Filter for speech recognition in noisy environments , 2017, Comput. Speech Lang..

[53]  Simon J. Godsill,et al.  Efficient Alternatives to the Ephraim and Malah Suppression Rule for Audio Signal Enhancement , 2003, EURASIP J. Adv. Signal Process..