Dual-channel eKF-RTF framework for speech enhancement with DNN-based speech presence estimation

This paper presents a dual-channel speech enhancement framework that effectively integrates deep neural network (DNN) mask estimators. Our framework follows a beamforming-plus-postfiltering approach intended for noise reduction on dual-microphone smartphones. An extended Kalman filter is used for the estimation of the relative acoustic channel between microphones, while the noise estimation is performed using a speech presence probability estimator. We propose the use of a DNN estimator to improve the prediction of the speech presence probabilities without making any assumption about the statistics of the signals. We evaluate and compare different dual-channel features to improve the accuracy of this estimator, including the power and phase difference between the speech signals at the two microphones. The proposed integrated scheme is evaluated in different reverberant and noisy environments when the smartphone is used in both closeand far-talk positions. The experimental results show that our approach achieves significant improvements in terms of speech quality, intelligibility, and distortion when compared to other approaches based only on statistical signal processing.

[1]  Christophe Beaugeant,et al.  Noise reduction for dual-microphone mobile phones exploiting power level differences , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Ángel M. Gómez,et al.  Dual-channel DNN-based speech enhancement for smartphones , 2017, 2017 IEEE 19th International Workshop on Multimedia Signal Processing (MMSP).

[3]  Emmanuel Vincent,et al.  A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[5]  Yuzhou Liu,et al.  Neural Network Based Time-Frequency Masking and Steering Vector Estimation for Two-Channel Mvdr Beamforming , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Ángel M. Gómez,et al.  Deep Neural Network-Based Noise Estimation for Robust ASR in Dual-Microphone Smartphones , 2016, IberSPEECH.

[7]  Ángel M. Gómez,et al.  Feature enhancement for robust speech recognition on smartphones with dual-microphone , 2014, 2014 22nd European Signal Processing Conference (EUSIPCO).

[8]  Reinhold Häb-Umbach,et al.  A generic neural acoustic beamforming architecture for robust multi-channel speech processing , 2017, Comput. Speech Lang..

[9]  E.A.P. Habets,et al.  Dual-Microphone Speech Dereverberation in a Noisy Environment , 2006, 2006 IEEE International Symposium on Signal Processing and Information Technology.

[10]  Reinhold Häb-Umbach,et al.  Neural network based spectral mask estimation for acoustic beamforming , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Ángel M. Gómez,et al.  Unscented Transform-Based Dual-Channel Noise Estimation: Application to Speech Enhancement on Smartphones , 2018, 2018 41st International Conference on Telecommunications and Signal Processing (TSP).

[12]  Sandhya Hawaldar,et al.  Speech Enhancement for Nonstationary Noise Environments , 2011 .

[13]  Antonio M. Peinado,et al.  Dual-Channel Speech Enhancement Based on Extended Kalman Filter Relative Transfer Function Estimation , 2019, Applied Sciences.

[14]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Ángel M. Gómez,et al.  A postfiltering approach for dual-microphone smartphones , 2018, IberSPEECH.

[16]  Reinhold Haeb-Umbach,et al.  Smoothing along Frequency in Online Neural Network Supported Acoustic Beamforming , 2018, ITG Symposium on Speech Communication.

[17]  Zhong-Qiu Wang,et al.  Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Thomas Esch,et al.  Efficient musical noise suppression for speech enhancement system , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Alex Acero,et al.  Sound capture system and spatial filter for small devices , 2008, INTERSPEECH.

[20]  Jacob Benesty,et al.  An Integrated Solution for Online Multichannel Noise Tracking and Reduction , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Søren Holdt Jensen,et al.  Maximum Likelihood PSD Estimation for Speech Enhancement in Reverberation and Noise , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  DeLiang Wang,et al.  Real-time Speech Enhancement Using an Efficient Convolutional Recurrent Network for Dual-microphone Mobile Phones in Close-talk Scenarios , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Lutz Prechelt,et al.  Early Stopping - But When? , 2012, Neural Networks: Tricks of the Trade.

[24]  Ángel M. Gómez,et al.  Dual-channel spectral weighting for robust speech recognition in mobile devices , 2018, Digit. Signal Process..

[25]  Emanuel A. P. Habets,et al.  Time–Frequency Masking Based Online Multi-Channel Speech Enhancement With Convolutional Recurrent Neural Networks , 2019, IEEE Journal of Selected Topics in Signal Processing.

[26]  Duc Minh Nguyen,et al.  An MC-SPP approach for noise reduction in dual microphone case with power level difference , 2014, 2014 International Conference on Advanced Technologies for Communications (ATC 2014).

[27]  Xiaodong Li,et al.  A Statistical Analysis of Two-Channel Post-Filter Estimators in Isotropic Noise Fields , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Wei Xiao,et al.  Multi-channel noise reduction for hands-free voice communication on mobile phones , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Christophe Beaugeant,et al.  Dual microphone noise PSD estimation for mobile phones in hands-free position exploiting the coherence and speech presence probability , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[30]  Ángel M. Gómez,et al.  A Deep Neural Network Approach for Missing-Data Mask Estimation on Dual-Microphone Smartphones: Application to Noise-Robust Speech Recognition , 2014, IberSPEECH.