论文信息 - Frame-by-Frame Closed-Form Update for Mask-Based Adaptive MVDR Beamforming

Frame-by-Frame Closed-Form Update for Mask-Based Adaptive MVDR Beamforming

Beamforming approaches using time-frequency masks have recently been investigated and have shown promising results for noise robust automatic speech recognition (ASR) in many tasks. The time-frequency masks are estimated to compute the spatial statistics of target speech and noise signals, and then the statistics are used to derive a beamformer. Although its effectiveness has been clearly shown in batch and blockwise processing, it has not been well extended to frame-by-frame processing, which is a very important procedure for many actual applications. In this paper, we derive a frame-by-frame update rule for a mask-based minimum variance distortion-less response (MVDR) beamformer, which enables us to obtain enhanced signals without a long delay by combining it with uni-directional recurrent neural network-based mask estimation. Based on the Woodbury matrix identity, our algorithm achieves a closed-form solution of the mask-based MVDR beamformer at every time frame without any matrix inversion. Experimental results show that our frame-by-frame beamformer outperforms baseline block-wise beamforming on the CHiME-3 simulation dataset even with a shorter time delay.

[1] Jon Barker,et al. The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[2] Tomohiro Nakatani,et al. Adversarial training for data-driven speech enhancement without parallel corpus , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[3] Tomohiro Nakatani,et al. Online MVDR Beamformer Based on Complex Gaussian Mixture Model With Spatial Prior for Noise Robust ASR , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4] Reinhold Haeb-Umbach,et al. Adaptive Filter-and-Sum Beamforming in Spatially Correlated Noise , 2005 .

[5] Mitch Weintraub,et al. Acoustic Modeling for Google Home , 2017, INTERSPEECH.

[6] Shigeru Katagiri,et al. Does speech enhancement work with end-to-end ASR objectives?: Experimental analysis of multichannel end-to-end ASR , 2017, 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).

[7] Tara N. Sainath,et al. Deep Convolutional Neural Networks for Large-scale Speech Tasks , 2015, Neural Networks.

[8] O. L. Frost,et al. An algorithm for linearly constrained adaptive array processing , 1972 .

[9] Tara N. Sainath,et al. Neural Network Adaptive Beamforming for Robust Multichannel Speech Recognition , 2016, INTERSPEECH.

[10] Reinhold Häb-Umbach,et al. Neural network based spectral mask estimation for acoustic beamforming , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] Chng Eng Siong,et al. On time-frequency mask estimation for MVDR beamforming with application in robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Cong Liu,et al. The USTC-iFlytek System for CHiME-4 Challenge , 2016 .

[13] Jonathan Le Roux,et al. Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks , 2016, INTERSPEECH.

[14] Takuya Yoshioka,et al. Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Hiroshi Sawada,et al. A Multichannel MMSE-Based Framework for Speech Source Separation and Noise Reduction , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[16] Janet M. Baker,et al. The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[17] Gerald Penn,et al. Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18] Reinhold Häb-Umbach,et al. Beamnet: End-to-end training of a beamformer-supported multi-channel ASR system , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Lukás Burget,et al. Recurrent neural network based language model , 2010, INTERSPEECH.

[20] John R. Hershey,et al. Multichannel End-to-end Speech Recognition , 2017, ICML.

[21] Liang Lu,et al. Deep beamforming networks for multi-channel speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22] Lukás Burget,et al. Empirical Evaluation and Combination of Advanced Language Modeling Techniques , 2011, INTERSPEECH.

[23] Chengzhu Yu,et al. The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).