Highway long short-term memory RNNS for distant speech recognition

In this paper, we extend the deep long short-term memory (DL-STM) recurrent neural networks by introducing gated direct connections between memory cells in adjacent layers. These direct links, called highway connections, enable unimpeded information flow across different layers and thus alleviate the gradient vanishing problem when building deeper LSTMs. We further introduce the latency-controlled bidirectional LSTMs (BLSTMs) which can exploit the whole history while keeping the latency under control. Efficient algorithms are proposed to train these novel networks using both frame and sequence discriminative criteria. Experiments on the AMI distant speech recognition (DSR) task indicate that we can train deeper LSTMs and achieve better improvement from sequence training with highway LSTMs (HLSTMs). Our novel model obtains 43.9/47.7% WER on AMI (SDM) dev and eval sets, outperforming all previous works. It beats the strong DNN and DLSTM baselines with 15.7% and 5.3% relative improvement respectively.

[1]  Steve Renals,et al.  Convolutional Neural Networks for Distant Speech Recognition , 2014, IEEE Signal Processing Letters.

[2]  Geoffrey Zweig,et al.  An introduction to computational networks and the computational network toolkit (invited talk) , 2014, INTERSPEECH.

[3]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[4]  Bhiksha Raj,et al.  Microphone Array Processing for Distant Speech Recognition: From Close-Talking Microphones to Far-Field Sensors , 2012, IEEE Signal Processing Magazine.

[5]  Navdeep Jaitly,et al.  Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[6]  Lukás Burget,et al.  Transcribing Meetings With the AMIDA Systems , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[8]  Andreas Stolcke,et al.  Making themost from multiple microphones in meeting recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Yongqiang Wang,et al.  An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Steve Renals,et al.  Hybrid acoustic models for distant and multichannel large vocabulary speech recognition , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[11]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[12]  Georg Heigold,et al.  Asynchronous stochastic optimization for sequence training of deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[14]  Kai Chen,et al.  Training Deep Bidirectional LSTM Acoustic Model for LVCSR by a Context-Sensitive-Chunk BPTT Approach , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Ekaterina Vylomova,et al.  Depth-Gated LSTM , 2015, ArXiv.

[16]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[17]  Jürgen Schmidhuber,et al.  Highway Networks , 2015, ArXiv.

[18]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[19]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Jing Peng,et al.  An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories , 1990, Neural Computation.

[22]  Jean Carletta,et al.  Unleashing the killer corpus: experiences in creating the multi-everything AMI Meeting Corpus , 2007, Lang. Resour. Evaluation.

[23]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[24]  Alex Graves,et al.  Grid Long Short-Term Memory , 2015, ICLR.