Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks

Long Short-Term Memory (LSTM) recurrent neural network has proven effective in modeling speech and has achieved outstanding performance in both speech enhancement (SE) and automatic speech recognition (ASR). To further improve the performance of noise-robust speech recognition, a combination of speech enhancement and recognition was shown to be promising in earlier work. This paper aims to explore options for consistent integration of SE and ASR using LSTM networks. Since SE and ASR have different objective criteria, it is not clear what kind of integration would finally lead to the best word error rate for noise-robust ASR tasks. In this work, several integration architectures are proposed and tested, including: (1) a pipeline architecture of LSTM-based SE and ASR with sequence training, (2) an alternating estimation architecture, and (3) a multi-task hybrid LSTM network architecture. The proposed models were evaluated on the 2nd CHiME speech separation and recognition challenge task, and show significant improvements relative to prior results.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Jun Du,et al.  Robust speech recognition with speech enhanced deep neural networks , 2014, INTERSPEECH.

[3]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[4]  Yongqiang Wang,et al.  An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Yuuki Tachioka,et al.  The MERL/MELCO/TUM system for the REVERB Challenge using Deep Recurrent Neural Network Feature Enhancement , 2014, ICASSP 2014.

[6]  Jon Barker,et al.  The second ‘chime’ speech separation and recognition challenge: Datasets, tasks and baselines , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Björn W. Schuller,et al.  Discriminatively trained recurrent neural networks for single-channel speech separation , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[8]  Georg Heigold,et al.  Sequence discriminative distributed training of long short-term memory recurrent neural networks , 2014, INTERSPEECH.

[9]  Tara N. Sainath,et al.  Scalable Minimum Bayes Risk Training of Deep Neural Network Acoustic Models Using Distributed Hessian-free Optimization , 2012, INTERSPEECH.

[10]  Biing-Hwang Juang,et al.  Recurrent deep neural networks for robust speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Jun Du,et al.  Speech separation based on improved deep neural networks with dual outputs of speech features for both target and interfering speakers , 2014, The 9th International Symposium on Chinese Spoken Language Processing.

[12]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[13]  Andrew W. Senior,et al.  Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition , 2014, ArXiv.

[14]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[15]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[16]  Hermann Ney,et al.  Investigations on error minimizing training criteria for discriminative training in automatic speech recognition , 2005, INTERSPEECH.

[17]  Björn W. Schuller,et al.  Robust speech recognition using long short-term memory recurrent neural networks for hybrid acoustic modelling , 2014, INTERSPEECH.

[18]  Björn W. Schuller,et al.  Memory-Enhanced Neural Networks and NMF for Robust ASR , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Yuuki Tachioka,et al.  DISCRIMINATIVE METHODS FOR NOISE ROBUST SPEECH RECOGNITION: A CHIME CHALLENGE BENCHMARK , 2013 .