Leveraging Low-Distortion Target Estimates for Improved Speech Enhancement

A promising approach for multi-microphone speech separation involves two deep neural networks (DNN), where the predicted target speech from the first DNN is used to compute signal statistics for time-invariant minimum variance distortionless response (MVDR) beamforming, and the MVDR result is then used as extra features for the second DNN to predict target speech. Previous studies suggested that the MVDR result can provide complementary information for the second DNN to better predict target speech. However, on fixed-geometry arrays, both DNNs can take in, for example, the real and imaginary (RI) components of the multi-channel mixture as features to leverage the spatial and spectral information for enhancement. It is not explained clearly why the linear MVDR result can be complementary and why it is still needed, considering that the DNNs and the beamformer use the same input, and the DNNs perform non-linear filtering and could render the linear filtering of MVDR unnecessary. Similarly, in monaural cases, one can replace the MVDR beamformer with a monaural weighted prediction error (WPE) filter. Although the linear WPE filter and the DNNs use the same mixture RI components as input, the WPE result is found to significantly improve the second DNN. This study provides a novel explanation from the perspective of the low-distortion nature of such algorithms, and finds that they can consistently improve phase estimation. Equipped with this understanding, we investigate several low-distortion target estimation algorithms including several beamformers, WPE, forward convolutive prediction (FCP), and their combinations, and use their results as extra features to train the second network to achieve better enhancement. Evaluation results on singleand multi-microphone speech dereverberation and enhancement tasks indicate the effectiveness of the proposed approach, and the validity of the proposed view.

[1]  Zhong-Qiu Wang,et al.  Complex Spectral Mapping for Single- and Multi-Channel Speech Enhancement and Robust ASR , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2]  DeLiang Wang,et al.  A speech enhancement algorithm by iterating single- and multi-microphone processing and its application to robust ASR , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  DeLiang Wang,et al.  Deep Learning Based Real-Time Speech Enhancement for Dual-Microphone Mobile Phones , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  DeLiang Wang,et al.  Multi-microphone Complex Spectral Mapping for Utterance-wise and Continuous Speech Separation , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  R. Maas,et al.  A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing research , 2016, EURASIP Journal on Advances in Signal Processing.

[6]  Zhong-Qiu Wang,et al.  Multi-Microphone Complex Spectral Mapping for Speech Dereverberation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Lei Xie,et al.  DESNet: A Multi-Channel Network for Simultaneous Speech Dereverberation, Enhancement and Separation , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[8]  Hiroshi Sawada,et al.  Blind and Neural Network-Guided Convolutional Beamformer for Joint Denoising, Dereverberation, and Source Separation , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  DeLiang Wang,et al.  Robust Speaker Recognition Based on Single-Channel and Multi-Channel Speech Enhancement , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Tomohiro Nakatani,et al.  Neural Network-Based Spectrum Estimation for Online WPE Dereverberation , 2017, INTERSPEECH.

[12]  Panagiotis Tzirakis,et al.  Multi-Channel Speech Enhancement Using Graph Neural Networks , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Reinhold Haeb-Umbach,et al.  Front-end processing for the CHiME-5 dinner party scenario , 2018, 5th International Workshop on Speech Processing in Everyday Environments (CHiME 2018).

[14]  Tomohiro Nakatani,et al.  Making Machines Understand Us in Reverberant Rooms: Robustness Against Reverberation for Automatic Speech Recognition , 2012, IEEE Signal Process. Mag..

[15]  Yong Xu,et al.  Enhancing End-to-End Multi-Channel Speech Separation Via Spatial Feature Learning , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Xavier Serra,et al.  FSD50K: an Open Dataset of Human-Labeled Sound Events , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Efthymios Tzinis,et al.  Separate But Together: Unsupervised Federated Learning for Speech Enhancement from Non-IID Data , 2021, 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[18]  Jonathan Le Roux,et al.  SDR – Half-baked or Well Done? , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Tomohiro Nakatani,et al.  A Unified Convolutional Beamformer for Simultaneous Denoising and Dereverberation , 2018, IEEE Signal Processing Letters.

[20]  Shinji Watanabe,et al.  End-to-End Dereverberation, Beamforming, and Speech Recognition with Improved Numerical Stability and Advanced Frontend , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Reinhold Haeb-Umbach,et al.  A Comparison and Combination of Unsupervised Blind Source Separation Techniques , 2021, ITG Conference on Speech Communication.

[22]  Chengzhu Yu,et al.  The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[23]  Japan,et al.  Far-Field Automatic Speech Recognition , 2020, Proceedings of the IEEE.

[24]  DeLiang Wang,et al.  Divide and Conquer: A Deep CASA Approach to Talker-Independent Monaural Speaker Separation , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  DeLiang Wang,et al.  Complex Ratio Masking for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26]  GannotSharon,et al.  A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation , 2017 .

[27]  Tomohiro Nakatani,et al.  Integrating Neural Network Based Beamforming and Weighted Prediction Error Dereverberation , 2018, INTERSPEECH.

[28]  DeLiang Wang,et al.  Learning Complex Spectral Mapping With Gated Convolutional Recurrent Networks for Monaural Speech Enhancement , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[29]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[31]  Reinhold Häb-Umbach,et al.  BLSTM supported GEV beamformer front-end for the 3RD CHiME challenge , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[32]  Yu Tsao,et al.  Complex spectrogram enhancement by convolutional neural network with multi-metrics learning , 2017, 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).

[33]  Jean-Marc Valin,et al.  PoCoNet: Better Speech Enhancement with Frequency-Positional Embeddings, Semi-Supervised Conversational Data, and Biased Loss , 2020, INTERSPEECH.

[34]  Reinhold Häb-Umbach,et al.  Blind Acoustic Beamforming Based on Generalized Eigenvalue Decomposition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  Biing-Hwang Juang,et al.  Speech Dereverberation Based on Variance-Normalized Delayed Linear Prediction , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[36]  Nima Mesgarani,et al.  Real-Time Binaural Speech Separation with Preserved Spatial Cues , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Zhong-Qiu Wang,et al.  On the Compensation Between Magnitude and Phase in Speech Separation , 2021, IEEE Signal Processing Letters.

[38]  DeLiang Wang,et al.  Deep Learning Based Phase Reconstruction for Speaker Separation: A Trigonometric Perspective , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Reinhold Häb-Umbach,et al.  Neural network based spectral mask estimation for acoustic beamforming , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Jonathan Le Roux,et al.  Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Lianwu Chen,et al.  A Causal U-Net Based Neural Beamforming Network for Real-Time Multi-Channel Speech Enhancement , 2021, Interspeech 2021.

[42]  Zhong-Qiu Wang,et al.  Deep Learning Based Target Cancellation for Speech Dereverberation , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[43]  Jon Barker,et al.  On End-to-end Multi-channel Time Domain Speech Separation in Reverberant Environments , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Zhong-Qiu Wang,et al.  Convolutive Prediction for Monaural Speech Dereverberation and Noisy-Reverberant Speaker Separation , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[45]  Tomohiro Nakatani,et al.  Jointly Optimal Dereverberation and Beamforming , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46]  Justin Salamon,et al.  What’s all the Fuss about Free Universal Sound Separation Data? , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).