Teacher-Student BLSTM Mask Model for Robust Acoustic Beamforming

Microphone array beamforming has been approved to be an effective approach for suppressing adverse interferences. Recently, acoustic beamformers employing neural networks (NN) for time-frequency (T-F) mask prediction, termed as Mask-BF, have received tremendous interest and shown great benefits as a front-end for distant automatic speech recognition (ASR). However, our preliminary experiments using Mask-BF for ASR task show that the mask model trained with only simulated training data underperforms when the real-recording data appears in the testing stage, where a data mismatch problem occurs. In this study, we aim at reducing the impact of the data mismatch on the mask model. Our research is quite intuitive that the real-recording data can be used together with the simulated data to make the mask model more robust against data mismatch problem. Specifically, two bi-directional long short-term memory (BLSTM) models, are designed as a teacher mask model and a student mask model, respectively. The teacher mask model is trained with simulated data, and it is then employed to generate the soft mask labels for both simulated and real-recording data separately. Then, the simulated data and the real-recording data with generated soft mask labels form the new training data to train the student mask model. As a result, a novel T-S mask BF is developed accordingly. Our T-S mask BF is evaluated as a frontend for ASR on the CHiME-3 dataset. Experimental results show that the generalization ability of our T-S mask BF is enhanced where we obtain relative 4% word error rate (WER) reduction compared to the baseline Mask-BF in the real-recording test set.

[1]  Patrick Kenny,et al.  Phonemic hidden Markov models with continuous mixture output densities for large vocabulary word recognition , 1991, IEEE Trans. Signal Process..

[2]  Daniel Mossé,et al.  Speech recognition and voice separation for the internet of things , 2018, IOT.

[3]  Shinji Watanabe,et al.  Student-Teacher Learning for BLSTM Mask-based Speech Enhancement , 2018, INTERSPEECH.

[4]  Xavier Anguera Miró,et al.  Acoustic Beamforming for Speaker Diarization of Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Sree Hari Krishnan Parthasarathi,et al.  Improving Noise Robustness of Automatic Speech Recognition via Parallel Data and Teacher-student Learning , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Emmanuel Vincent,et al.  A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[8]  Reinhold Häb-Umbach,et al.  Neural network based spectral mask estimation for acoustic beamforming , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Reinhold Häb-Umbach,et al.  Speech Enhancement With a GSC-Like Structure Employing Eigenvector-Based Transfer Function Ratios Estimation , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  DeLiang Wang,et al.  A speech enhancement algorithm by iterating single- and multi-microphone processing and its application to robust ASR , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Ying Zhou,et al.  Robust Mask Estimation By Integrating Neural Network-Based and Clustering-Based Approaches for Adaptive Acoustic Beamforming , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Jun Du,et al.  On Design of Robust Deep Models for CHiME-4 Multi-Channel Speech Recognition with Multiple Configurations of Array Microphones , 2017, INTERSPEECH.

[14]  Chengzhu Yu,et al.  The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[15]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[16]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Kai Feng,et al.  The subspace Gaussian mixture model - A structured model for speech recognition , 2011, Comput. Speech Lang..

[18]  Dong Yu,et al.  Recent progresses in deep learning based acoustic models , 2017, IEEE/CAA Journal of Automatica Sinica.

[19]  Jon Barker,et al.  The third 'CHiME' speech separation and recognition challenge: Analysis and outcomes , 2017, Comput. Speech Lang..

[20]  Jon Barker,et al.  An analysis of environment, microphone and data simulation mismatches in robust speech recognition , 2017, Comput. Speech Lang..