A speech enhancement algorithm by iterating single- and multi-microphone processing and its application to robust ASR

We propose a speech enhancement algorithm based on single- and multi-microphone processing techniques. The core of the algorithm estimates a time-frequency mask which represents the target speech and use masking-based beamforming to enhance corrupted speech. Specifically, in single-microphone processing, the received signals of a microphone array are treated as individual signals and we estimate a mask for the signal of each microphone using a deep neural network (DNN). With these masks, in multi-microphone processing, we calculate a spatial covariance matrix of noise and steering vector for beamforming. In addition, we propose a masking-based post-filter to further suppress the noise in the output of beamforming. Then, the enhanced speech is sent back to DNN for mask re-estimation. When these steps are iterated for a few times, we obtain the final enhanced speech. The proposed algorithm is evaluated as a frontend for automatic speech recognition (ASR) and achieves a 5.05% average word error rate (WER) on the real environment test set of CHiME-3, outperforming the current best algorithm by 13.34%.

[1]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[2]  Jae S. Lim,et al.  Speech enhancement , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[4]  O. L. Frost,et al.  An algorithm for linearly constrained adaptive array processing , 1972 .

[5]  Reinhold Häb-Umbach,et al.  Neural network based spectral mask estimation for acoustic beamforming , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[7]  Xavier Anguera Miró,et al.  Acoustic Beamforming for Speaker Diarization of Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  DeLiang Wang,et al.  Towards Scaling Up Classification-Based Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Reinhold Häb-Umbach,et al.  BLSTM supported GEV beamformer front-end for the 3RD CHiME challenge , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[10]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[11]  Reinhold Häb-Umbach,et al.  Blind Acoustic Beamforming Based on Generalized Eigenvalue Decomposition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[13]  Takuya Yoshioka,et al.  Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[15]  Chengzhu Yu,et al.  The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[16]  References , 1971 .

[17]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.