ADL-MVDR: All Deep Learning MVDR Beamformer for Target Speech Separation

Speech separation algorithms are often used to separate the target speech from other interfering sources. However, purely neural network based speech separation systems often cause nonlinear distortion that is harmful for ASR systems. The conventional mask-based minimum variance distortionless response (MVDR) beamformer can be used to minimize the distortion, but comes with high level of residual noise. Furthermore, the matrix inversion and eigenvalue decomposition processes involved in the conventional MVDR solution are not stable when jointly trained with neural networks. In this paper, we propose a novel all deep learning MVDR framework, where the matrix inversion and eigenvalue decomposition are replaced by two recurrent neural networks (RNNs), to resolve both issues at the same time. The proposed method can greatly reduce the residual noise while keeping the target speech undistorted by leveraging on the RNN-predicted frame-wise beamforming weights. The system is evaluated on a Mandarin audio-visual corpus and compared against several state-of-the-art (SOTA) speech separation systems. Experimental results demonstrate the superiority of the proposed method across several objective metrics and ASR accuracy.

[1]  Marc Moonen,et al.  Speech enhancement with multichannel Wiener filter techniques in multimicrophone binaural hearing aids. , 2009, The Journal of the Acoustical Society of America.

[2]  Dong Yu,et al.  Multi-Modal Multi-Channel Target Speech Separation , 2020, IEEE Journal of Selected Topics in Signal Processing.

[3]  Simon Doclo,et al.  DNN-Based Multi-Frame MVDR Filtering for Single-Microphone Speech Enhancement , 2019, ArXiv.

[4]  Tomohiro Nakatani,et al.  Frame-by-Frame Closed-Form Update for Mask-Based Adaptive MVDR Beamforming , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  Yi Shen,et al.  On Loss Functions and Recurrency Training for GAN-based Speech Enhancement Systems , 2020, INTERSPEECH.

[7]  Jun Du,et al.  Robust speech recognition with speech enhanced deep neural networks , 2014, INTERSPEECH.

[8]  B.D. Van Veen,et al.  Beamforming: a versatile approach to spatial filtering , 1988, IEEE ASSP Magazine.

[9]  Emanuel A. P. Habets,et al.  Deep Filtering: Signal Extraction and Reconstruction Using Complex Time-Frequency Filters , 2019, IEEE Signal Processing Letters.

[10]  Björn W. Schuller,et al.  Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR , 2015, LVA/ICA.

[11]  Yi Shen,et al.  Investigation of Phase Distortion on Perceived Speech Quality for Hearing-impaired Listeners , 2020, INTERSPEECH.

[12]  Shuzhi Sam Ge,et al.  Design and analysis of a general recurrent neural network model for time-varying matrix inversion , 2005, IEEE Transactions on Neural Networks.

[13]  Jun Wang,et al.  A recurrent neural network for real-time matrix inversion , 1993 .

[14]  Reinhold Häb-Umbach,et al.  Neural network based spectral mask estimation for acoustic beamforming , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Dong Yu,et al.  Neural Spatio-Temporal Beamformer for Target Speech Separation , 2020, INTERSPEECH.

[16]  Nima Mesgarani,et al.  TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Tatsuya Kawahara,et al.  Unsupervised Beamforming Based on Multichannel Nonnegative Matrix Factorization for Noisy Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  DeLiang Wang,et al.  Complex Ratio Masking for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Takuya Yoshioka,et al.  Exploring Practical Aspects of Neural Mask-Based Beamforming for Far-Field Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Jonathan Le Roux,et al.  Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks , 2016, INTERSPEECH.

[21]  Dong Yu,et al.  Audio-visual Multi-channel Recognition of Overlapped Speech , 2020, INTERSPEECH.

[22]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[24]  Chng Eng Siong,et al.  On time-frequency mask estimation for MVDR beamforming with application in robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Xiong Xiao,et al.  Multi-Channel Overlapped Speech Recognition with Location Guided Speech Extraction Network , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[26]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[28]  Douglas L. Jones,et al.  A Study of Learning Based Beamforming Methods for Speech Recognition , 2016 .

[29]  Yong Xu,et al.  Joint Training of Complex Ratio Mask Based Beamformer and Acoustic Model for Noise Robust Asr , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).