Online LSTM-based Iterative Mask Estimation for Multi-Channel Speech Enhancement and ASR

Accurate steering vector estimation is the key point for a beamformer which suppresses the background noise to improve the noisy speech quality and intelligibility. Recently, time-frequency masking approach, which estimates the steering vectors that are utilized for a beamformer, is popular in this field. In particular, we have proposed an iterative mask estimation (IME) approach to improve the complex Gaussian mixture model (CGMM) based beamforming and yield the best system for multi-channel ASR in CHiME-4 challenge [1]. And in [2], we also demonstrated that our algorithm could improve the speech quality (PESQ) and intelligibility (STOI) for multi-channel speech enhancement. In this study, we focus on the online processing of our IME algorithm for multi-channel speech enhancement and ASR, which achieves comparable performance to the offline version. In addition, a regression long short-term memory recurrent neural network (LSTM-RNN) for a multiple-target joint learning is utilized, denoted as LSTM-MT, to replace two separate models in [2]. Experiments on the CHiME-4 simulation data show that the online IME algorithm can improve the enhancement performance, e.g., PESQ from 2.18 to 2.58 and STOI from 86.85 to 94.51, which is comparable to those obtained by offline IME. Furthermore, the LSTM-MT based post-processing can achieve an additional PESQ improvement from 2.58 to 2.71. And experiments on the CHiME-4 real data show that the online IME approach outperforms the online CGMM-based approach, with a relative word error reduction (WER) of 14.49%.

[1]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[2]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[3]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[4]  Emmanuel Vincent,et al.  Multichannel Audio Source Separation With Deep Neural Networks , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Reinhold Häb-Umbach,et al.  BLSTM supported GEV beamformer front-end for the 3RD CHiME challenge , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[6]  Tomohiro Nakatani,et al.  Online MVDR Beamformer Based on Complex Gaussian Mixture Model With Spatial Prior for Noise Robust ASR , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Björn W. Schuller,et al.  Discriminatively trained recurrent neural networks for single-channel speech separation , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[8]  Jun Du,et al.  Speech Separation based on signal-noise-dependent deep neural networks for robust speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Tara N. Sainath,et al.  Multichannel Signal Processing With Deep Neural Networks for Automatic Speech Recognition , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Rémi Gribonval,et al.  Oracle estimators for the benchmarking of source separation algorithms , 2007, Signal Process..

[11]  Jon Barker,et al.  An analysis of environment, microphone and data simulation mismatches in robust speech recognition , 2017, Comput. Speech Lang..

[12]  Takuya Yoshioka,et al.  Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Walter Kellermann,et al.  A generalization of blind source separation algorithms for convolutive mixtures based on second-order statistics , 2005, IEEE Transactions on Speech and Audio Processing.

[14]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[15]  Jun Du,et al.  LSTM-based iterative mask estimation and post-processing for multi-channel speech enhancement , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[16]  Jun Du,et al.  Joint training of front-end and back-end deep neural networks for robust speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Jun Du,et al.  On Design of Robust Deep Models for CHiME-4 Multi-Channel Speech Recognition with Multiple Configurations of Array Microphones , 2017, INTERSPEECH.

[18]  Reinhold Häb-Umbach,et al.  Speech Enhancement With a GSC-Like Structure Employing Eigenvector-Based Transfer Function Ratios Estimation , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Marc Moonen,et al.  Performance Analysis of Multichannel Wiener Filter-Based Noise Reduction in Hearing Aids Under Second Order Statistics Estimation Errors , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Björn W. Schuller,et al.  Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR , 2015, LVA/ICA.

[22]  B.D. Van Veen,et al.  Beamforming: a versatile approach to spatial filtering , 1988, IEEE ASSP Magazine.

[23]  Jun Du,et al.  Joint training of DNNs by incorporating an explicit dereverberation structure for distant speech recognition , 2016, EURASIP J. Adv. Signal Process..

[24]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[25]  John R. Hershey,et al.  Unified Architecture for Multichannel End-to-End Speech Recognition With Neural Beamforming , 2017, IEEE Journal of Selected Topics in Signal Processing.

[26]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.