Frame-by-Frame Closed-Form Update for Mask-Based Adaptive MVDR Beamforming

Beamforming approaches using time-frequency masks have recently been investigated and have shown promising results for noise robust automatic speech recognition (ASR) in many tasks. The time-frequency masks are estimated to compute the spatial statistics of target speech and noise signals, and then the statistics are used to derive a beamformer. Although its effectiveness has been clearly shown in batch and blockwise processing, it has not been well extended to frame-by-frame processing, which is a very important procedure for many actual applications. In this paper, we derive a frame-by-frame update rule for a mask-based minimum variance distortion-less response (MVDR) beamformer, which enables us to obtain enhanced signals without a long delay by combining it with uni-directional recurrent neural network-based mask estimation. Based on the Woodbury matrix identity, our algorithm achieves a closed-form solution of the mask-based MVDR beamformer at every time frame without any matrix inversion. Experimental results show that our frame-by-frame beamformer outperforms baseline block-wise beamforming on the CHiME-3 simulation dataset even with a shorter time delay.

[1]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[2]  Tomohiro Nakatani,et al.  Adversarial training for data-driven speech enhancement without parallel corpus , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[3]  Tomohiro Nakatani,et al.  Online MVDR Beamformer Based on Complex Gaussian Mixture Model With Spatial Prior for Noise Robust ASR , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Reinhold Haeb-Umbach,et al.  Adaptive Filter-and-Sum Beamforming in Spatially Correlated Noise , 2005 .

[5]  Mitch Weintraub,et al.  Acoustic Modeling for Google Home , 2017, INTERSPEECH.

[6]  Shigeru Katagiri,et al.  Does speech enhancement work with end-to-end ASR objectives?: Experimental analysis of multichannel end-to-end ASR , 2017, 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).

[7]  Tara N. Sainath,et al.  Deep Convolutional Neural Networks for Large-scale Speech Tasks , 2015, Neural Networks.

[8]  O. L. Frost,et al.  An algorithm for linearly constrained adaptive array processing , 1972 .

[9]  Tara N. Sainath,et al.  Neural Network Adaptive Beamforming for Robust Multichannel Speech Recognition , 2016, INTERSPEECH.

[10]  Reinhold Häb-Umbach,et al.  Neural network based spectral mask estimation for acoustic beamforming , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Chng Eng Siong,et al.  On time-frequency mask estimation for MVDR beamforming with application in robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Cong Liu,et al.  The USTC-iFlytek System for CHiME-4 Challenge , 2016 .

[13]  Jonathan Le Roux,et al.  Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks , 2016, INTERSPEECH.

[14]  Takuya Yoshioka,et al.  Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Hiroshi Sawada,et al.  A Multichannel MMSE-Based Framework for Speech Source Separation and Noise Reduction , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[17]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Reinhold Häb-Umbach,et al.  Beamnet: End-to-end training of a beamformer-supported multi-channel ASR system , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[20]  John R. Hershey,et al.  Multichannel End-to-end Speech Recognition , 2017, ICML.

[21]  Liang Lu,et al.  Deep beamforming networks for multi-channel speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Lukás Burget,et al.  Empirical Evaluation and Combination of Advanced Language Modeling Techniques , 2011, INTERSPEECH.

[23]  Chengzhu Yu,et al.  The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).