Multi-Channel Multi-Frame ADL-MVDR for Target Speech Separation

Many purely neural network based speech separation approaches have been proposed to improve objective assessment scores, but they often introduce nonlinear distortions that are harmful to modern automatic speech recognition (ASR) systems. Minimum variance distortionless response (MVDR) filters are often adopted to remove nonlinear distortions, however, conventional neural mask-based MVDR systems still result in relatively high levels of residual noise. Moreover, the matrix inverse involved in the MVDR solution is sometimes numerically unstable during joint training with neural networks. In this study, we propose a multi-channel multi-frame (MCMF) all deep learning (ADL)-MVDR approach for target speech separation, which extends our preliminary multi-channel ADL-MVDR approach. The proposed MCMF ADL-MVDR system addresses linear and nonlinear distortions. Spatio-temporal cross correlations are also fully utilized in the proposed approach. The proposed systems are evaluated using a Mandarin audio-visual corpus and are compared with several state-of-the-art approaches. Experimental results demonstrate the superiority of our proposed systems under different scenarios and across several objective evaluation metrics, including ASR performance.

[1]  Reinhold Häb-Umbach,et al.  Blind Acoustic Beamforming Based on Generalized Eigenvalue Decomposition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Marc Moonen,et al.  Speech enhancement with multichannel Wiener filter techniques in multimicrophone binaural hearing aids. , 2009, The Journal of the Acoustical Society of America.

[3]  Emanuel A. P. Habets,et al.  Multi-Microphone Speech Dereverberation and Noise Reduction Using Relative Early Transfer Functions , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Tatsuya Kawahara,et al.  Unsupervised Beamforming Based on Multichannel Nonnegative Matrix Factorization for Noisy Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Colin Fyfe,et al.  A Neural Network for PCA and Beyond , 1997, Neural Processing Letters.

[6]  Boaz Rafaely,et al.  Microphone Array Signal Processing , 2008 .

[7]  Wei-Ying Wu,et al.  Numerical instability of calculating inverse of spatial covariance matrices , 2017 .

[8]  Tomohiro Nakatani,et al.  Online MVDR Beamformer Based on Complex Gaussian Mixture Model With Spatial Prior for Noise Robust ASR , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Donald S. Williamson,et al.  On Loss Functions and Recurrency Training for GAN-based Speech Enhancement Systems , 2020, INTERSPEECH.

[10]  Yong Xu,et al.  ADL-MVDR: All Deep Learning MVDR Beamformer for Target Speech Separation , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Rainer Martin,et al.  Estimation of Subband Speech Correlations for Noise Reduction via MVDR Processing , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Ju-Hong Lee,et al.  Finite Data Performance Analysis of Mvdr Antenna Array Beamformers with Diagonal Loading , 2013 .

[14]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[16]  Xiong Xiao,et al.  Multi-Channel Overlapped Speech Recognition with Location Guided Speech Extraction Network , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[17]  Jacob Benesty,et al.  Analysis and Comparison of Multichannel Noise Reduction Methods in a Common Framework , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Jacob Benesty,et al.  Performance Study of the MVDR Beamformer as a Function of the Source Incidence Angle , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Nima Mesgarani,et al.  TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Ehud Weinstein,et al.  Signal enhancement using beamforming and nonstationarity with applications to speech , 2001, IEEE Trans. Signal Process..

[21]  Zhong-Qiu Wang,et al.  All-Neural Multi-Channel Speech Enhancement , 2018, INTERSPEECH.

[22]  Chng Eng Siong,et al.  On time-frequency mask estimation for MVDR beamforming with application in robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Dong Yu,et al.  Audio-visual Multi-channel Recognition of Overlapped Speech , 2020, INTERSPEECH.

[24]  Björn W. Schuller,et al.  Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR , 2015, LVA/ICA.

[25]  Emanuel A. P. Habets,et al.  Time–Frequency Masking Based Online Multi-Channel Speech Enhancement With Convolutional Recurrent Neural Networks , 2019, IEEE Journal of Selected Topics in Signal Processing.

[26]  Tomohiro Nakatani,et al.  Frame-by-Frame Closed-Form Update for Mask-Based Adaptive MVDR Beamforming , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Xiaofei Wang,et al.  An Investigation of End-to-End Multichannel Speech Recognition for Reverberant and Mismatch Conditions , 2019 .

[28]  Emanuel A. P. Habets,et al.  Deep Filtering: Signal Extraction and Reconstruction Using Complex Time-Frequency Filters , 2019, IEEE Signal Processing Letters.

[29]  Kuldip K. Paliwal,et al.  The importance of phase in speech enhancement , 2011, Speech Commun..

[30]  Simon Doclo,et al.  DNN-Based Multi-Frame MVDR Filtering for Single-Microphone Speech Enhancement , 2019, ArXiv.

[31]  Emanuel A. P. Habets,et al.  A Two-Stage Beamforming Approach for Noise Reduction and Dereverberation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[32]  Shiliang Zhang,et al.  Deep-FSMN for Large Vocabulary Continuous Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Yi Shen,et al.  Investigation of Phase Distortion on Perceived Speech Quality for Hearing-impaired Listeners , 2020, INTERSPEECH.

[34]  Hong-Goo Kang,et al.  Phase-Sensitive Joint Learning Algorithms for Deep Learning-Based Speech Enhancement , 2018, IEEE Signal Processing Letters.

[35]  John R. Hershey,et al.  Unified Architecture for Multichannel End-to-End Speech Recognition With Neural Beamforming , 2017, IEEE Journal of Selected Topics in Signal Processing.

[36]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[37]  Reinhold Häb-Umbach,et al.  BLSTM supported GEV beamformer front-end for the 3RD CHiME challenge , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[38]  Yu Tsao,et al.  End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[39]  Shinji Watanabe,et al.  End-to-End Dereverberation, Beamforming, and Speech Recognition with Improved Numerical Stability and Advanced Frontend , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Jacob Benesty,et al.  A single-channel noise reduction MVDR filter , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Takuya Yoshioka,et al.  Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  L. J. Griffiths,et al.  An alternative approach to linearly constrained adaptive beamforming , 1982 .

[43]  Emanuel A. P. Habets,et al.  Nonstationary Noise PSD Matrix Estimation for Multichannel Blind Speech Extraction , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[44]  Reinhold Häb-Umbach,et al.  Neural network based spectral mask estimation for acoustic beamforming , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45]  B.D. Van Veen,et al.  Beamforming: a versatile approach to spatial filtering , 1988, IEEE ASSP Magazine.

[46]  Simon Doclo,et al.  Sensitivity analysis of the multi-frame MVDR filter for single-microphone speech enhancement , 2017, 2017 25th European Signal Processing Conference (EUSIPCO).

[47]  Jun Du,et al.  Robust speech recognition with speech enhanced deep neural networks , 2014, INTERSPEECH.

[48]  Yong Xu,et al.  A comprehensive study of speech separation: spectrogram vs waveform separation , 2019, INTERSPEECH.

[49]  Yingyue Xu,et al.  Distorting temporal fine structure by phase shifting and its effects on speech intelligibility and neural phase locking , 2017, Scientific Reports.

[50]  Jacob Benesty,et al.  An Integrated Solution for Online Multichannel Noise Tracking and Reduction , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[51]  Shengkui Zhao,et al.  A Fast-Converging Adaptive Frequency-Domain MVDR Beamformer for Speech Enhancement , 2012, INTERSPEECH.

[52]  Marc Moonen,et al.  GSVD-based optimal filtering for single and multimicrophone speech enhancement , 2002, IEEE Trans. Signal Process..

[53]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[54]  Douglas L. Jones,et al.  A Study of Learning Based Beamforming Methods for Speech Recognition , 2016 .

[55]  Zhong-Qiu Wang,et al.  End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction , 2018, INTERSPEECH.

[56]  X. Mestre,et al.  On diagonal loading for minimum variance beamformers , 2003, Proceedings of the 3rd IEEE International Symposium on Signal Processing and Information Technology (IEEE Cat. No.03EX795).

[57]  Jun Wang,et al.  A recurrent neural network for real-time matrix inversion , 1993 .

[58]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[59]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[60]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[61]  E. Oja Simplified neuron model as a principal component analyzer , 1982, Journal of mathematical biology.

[62]  Jonathan Le Roux,et al.  SDR – Half-baked or Well Done? , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[63]  Dong Yu,et al.  Neural Spatio-Temporal Beamformer for Target Speech Separation , 2020, INTERSPEECH.

[64]  Simon Doclo,et al.  Robust Constrained Mfmvdr Filtering for Single-Microphone Speech Enhancement , 2018, 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC).

[65]  Jesper Jensen,et al.  Online Multichannel Speech Enhancement Based on Recursive EM and DNN-Based Speech Presence Estimation , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[66]  Simon Dixon,et al.  Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation , 2018, ISMIR.

[67]  Zhong-Qiu Wang,et al.  Complex Spectral Mapping for Single- and Multi-Channel Speech Enhancement and Robust ASR , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[68]  Jacob Benesty,et al.  New insights into non-causal multichannel linear filtering for noise reduction , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[69]  Tetsuji Ogawa,et al.  Adversarial autoencoder for reducing nonlinear distortion , 2018, 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[70]  Antonio Bonafonte,et al.  SEGAN: Speech Enhancement Generative Adversarial Network , 2017, INTERSPEECH.

[71]  Shuzhi Sam Ge,et al.  Design and analysis of a general recurrent neural network model for time-varying matrix inversion , 2005, IEEE Transactions on Neural Networks.

[72]  Deliang Wang,et al.  On Spatial Features for Supervised Speech Separation and its Application to Beamforming and Robust ASR , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[73]  Wendi B. Heinzelman,et al.  Front-end speech enhancement for commercial speaker verification systems , 2018, Speech Commun..

[74]  Jacob Benesty,et al.  A Study of the LCMV and MVDR Noise Reduction Filters , 2010, IEEE Transactions on Signal Processing.

[75]  Reinhold Häb-Umbach,et al.  Beamnet: End-to-end training of a beamformer-supported multi-channel ASR system , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[76]  DeLiang Wang,et al.  Complex ratio masking for joint enhancement of magnitude and phase , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[77]  Yong Xu,et al.  Joint Training of Complex Ratio Mask Based Beamformer and Acoustic Model for Noise Robust Asr , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[78]  Tara N. Sainath,et al.  Neural Network Adaptive Beamforming for Robust Multichannel Speech Recognition , 2016, INTERSPEECH.

[79]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[80]  DeLiang Wang,et al.  A speech enhancement algorithm by iterating single- and multi-microphone processing and its application to robust ASR , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[81]  Jonathan Le Roux,et al.  Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks , 2016, INTERSPEECH.

[82]  Jacob Benesty,et al.  A Multi-Frame Approach to the Frequency-Domain Single-Channel Noise Reduction Problem , 2012, IEEE Transactions on Audio, Speech, and Language Processing.