Deep long short-term memory adaptive beamforming networks for multichannel robust speech recognition

Far-field speech recognition in noisy and reverberant conditions remains a challenging problem despite recent deep learning breakthroughs. This problem is commonly addressed by acquiring a speech signal from multiple microphones and performing beamforming over them. In this paper, we propose to use a recurrent neural network with long short-term memory (LSTM) architecture to adaptively estimate real-time beamforming filter coefficients to cope with non-stationary environmental noise and dynamic nature of source and microphones positions which results in a set of timevarying room impulse responses. The LSTM adaptive beamformer is jointly trained with a deep LSTM acoustic model to predict senone labels. Further, we use hidden units in the deep LSTM acoustic model to assist in predicting the beamforming filter coefficients. The proposed system achieves 7.97% absolute gain over baseline systems with no beamforming on CHiME-3 real evaluation set.

[1]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[2]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[3]  Tara N. Sainath,et al.  Factored spatial and spectral multichannel raw waveform CLDNNs , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Daniel Povey,et al.  Revisiting Recurrent Neural Networks for robust ASR , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Boaz Rafaely,et al.  Microphone Array Signal Processing , 2008 .

[7]  Björn W. Schuller,et al.  Memory-Enhanced Neural Networks and NMF for Robust ASR , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Biing-Hwang Juang,et al.  Recurrent deep neural networks for robust speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Tara N. Sainath,et al.  Neural Network Adaptive Beamforming for Robust Multichannel Speech Recognition , 2016, INTERSPEECH.

[10]  Liang Lu,et al.  Deep beamforming networks for multi-channel speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Jonathan Le Roux,et al.  Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Xavier Anguera Miró,et al.  Acoustic Beamforming for Speaker Diarization of Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[14]  Jonathan Le Roux,et al.  Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks , 2016, INTERSPEECH.

[15]  Navdeep Jaitly,et al.  Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[16]  B.D. Van Veen,et al.  Beamforming: a versatile approach to spatial filtering , 1988, IEEE ASSP Magazine.

[17]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[18]  Jonathan Le Roux,et al.  The MERL/SRI system for the 3RD CHiME challenge using beamforming, robust feature extraction, and advanced speech recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[19]  Reinhold Häb-Umbach,et al.  Blind Acoustic Beamforming Based on Generalized Eigenvalue Decomposition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Masakiyo Fujimoto,et al.  Strategies for distant speech recognitionin reverberant environments , 2015, EURASIP J. Adv. Signal Process..

[21]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[22]  Kenta Oono,et al.  Chainer : a Next-Generation Open Source Framework for Deep Learning , 2015 .

[23]  Björn W. Schuller,et al.  Feature enhancement by deep LSTM networks for ASR in reverberant multisource environments , 2014, Comput. Speech Lang..

[24]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[25]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[26]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Tara N. Sainath,et al.  Speaker location and microphone spacing invariant acoustic modeling from raw multichannel waveforms , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).