Performance of Mask Based Statistical Beamforming in a Smart Home Scenario

Mask based statistical beamforming, where signal statistics for the target and the interference gained from masking are used for beamforming, has shown great effectiveness in the two recent CHiME challenges. This idea has sparked interest in the research community and resulted in numerous proposed approaches based on the idea. At the same time, the advent of voice controlled smart home devices, such as Google Home and Amazon Alexa, has strengthened the need for robust far-field automatic speech recognition. In this paper, we evaluate if mask based beamforming can live up to the expectations created by the CHiME challenges and provide similar gains in a smart home scenario. To this extend, we pinpoint the main differences between the scenarios, review the recent developments and conduct extensive experiments on large scale data. These experiments show that, while a 10 % relative reduction of the word error rate can be achieved, the gains are not as high as those seen in the CHiME challenge. We also show that approaches where the frontend and back-end is trained jointly do not reach the performance level of their independently trained counterparts. On the plus side, we see a 20 % relative improvement for an evaluation set with crosstalk.

[1]  DeLiang Wang,et al.  A speech enhancement algorithm by iterating single- and multi-microphone processing and its application to robust ASR , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  R. Maas,et al.  A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing research , 2016, EURASIP Journal on Advances in Signal Processing.

[3]  Tara N. Sainath,et al.  Speaker location and microphone spacing invariant acoustic modeling from raw multichannel waveforms , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[4]  Reinhold Häb-Umbach,et al.  On the Computation of Complex-valued Gradients with Application to Statistically Optimum Beamforming , 2017, ArXiv.

[5]  Takuya Yoshioka,et al.  Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Reinhold Häb-Umbach,et al.  Blind Acoustic Beamforming Based on Generalized Eigenvalue Decomposition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Reinhold Häb-Umbach,et al.  Beamnet: End-to-end training of a beamformer-supported multi-channel ASR system , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Emmanuel Vincent When mismatched training data outperform matched data , 2017 .

[9]  Chng Eng Siong,et al.  On time-frequency mask estimation for MVDR beamforming with application in robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Tara N. Sainath,et al.  Lower Frame Rate Neural Network Acoustic Models , 2016, INTERSPEECH.

[11]  Reinhold Häb-Umbach,et al.  Neural network based spectral mask estimation for acoustic beamforming , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Tara N. Sainath,et al.  Factored spatial and spectral multichannel raw waveform CLDNNs , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[14]  Tara N. Sainath,et al.  Acoustic modelling with CD-CTC-SMBR LSTM RNNS , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[15]  John R. Hershey,et al.  Unified Architecture for Multichannel End-to-End Speech Recognition With Neural Beamforming , 2017, IEEE Journal of Selected Topics in Signal Processing.

[16]  Franz Pernkopf,et al.  DNN-based speech mask estimation for eigenvector beamforming , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Jonathan Le Roux,et al.  Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks , 2016, INTERSPEECH.

[18]  Yonghong Yan,et al.  Rank-1 constrained Multichannel Wiener Filter for speech recognition in noisy environments , 2017, Comput. Speech Lang..

[19]  Jon Barker,et al.  An analysis of environment, microphone and data simulation mismatches in robust speech recognition , 2017, Comput. Speech Lang..

[20]  Jacob Benesty,et al.  On Optimal Frequency-Domain Multichannel Linear Filtering for Noise Reduction , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Gary W. Elko,et al.  Spatial Coherence Functions for Differential Microphones in Isotropic Noise Fields , 2001, Microphone Arrays.