论文信息 - Unfolded recurrent neural networks for speech recognition

Unfolded recurrent neural networks for speech recognition

We introduce recurrent neural networks (RNNs) for acoustic modeling which are unfolded in time for a fixed number of time steps. The proposed models are feedforward networks with the property that the unfolded layers which correspond to the recurrent layer have time-shifted inputs and tied weight matrices. Besides the temporal depth due to unfolding, hierarchical processing depth is added by means of several non-recurrent hidden layers inserted between the unfolded layers and the output layer. The training of these models: (a) has a complexity that is comparable to deep neural networks (DNNs) with the same number of layers; (b) can be done on frame-randomized minibatches; (c) can be implemented efficiently through matrix-matrix operations on GPU architectures which makes it scalable for large tasks. Experimental results on the Switchboard 300 hours English conversational telephony task show a 5% relative improvement in word error rate over state-of-the-art DNNs trained on FMLLR features with i-vector speaker adaptation and hessianfree sequence discriminative training. Index Terms: recurrent neural networks, speech recognition

[1] Benjamin Schrauwen,et al. Training and Analysing Deep Recurrent Neural Networks , 2013, NIPS.

[2] Biing-Hwang Juang,et al. Recurrent deep neural networks for robust speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Jürgen Schmidhuber,et al. Learning Complex, Extended Sequences Using the Principle of History Compression , 1992, Neural Computation.

[4] Andrew W. Senior,et al. Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition , 2014, ArXiv.

[5] George Saon,et al. Speaker adaptation of neural network acoustic models using i-vectors , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[6] Ebru Arisoy,et al. Low-rank matrix factorization for Deep Neural Network training with high-dimensional output targets , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7] Yoshua Bengio,et al. Convolutional networks for images, speech, and time series , 1998 .

[8] P J Webros. BACKPROPAGATION THROUGH TIME: WHAT IT DOES AND HOW TO DO IT , 1990 .

[9] Geoffrey E. Hinton,et al. Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[10] Razvan Pascanu,et al. How to Construct Deep Recurrent Neural Networks , 2013, ICLR.

[11] Lukás Burget,et al. Recurrent neural network based language model , 2010, INTERSPEECH.

[12] Dong Yu,et al. Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[13] Geoffrey E. Hinton,et al. Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14] Kuldip K. Paliwal,et al. Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[15] Anthony J. Robinson,et al. An application of recurrent nets to phone probability estimation , 1994, IEEE Trans. Neural Networks.

[16] George Saon,et al. Neural network acoustic models for the DARPA RATS program , 2013, INTERSPEECH.

[17] Navdeep Jaitly,et al. Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[18] Tara N. Sainath,et al. Scalable Minimum Bayes Risk Training of Deep Neural Network Acoustic Models Using Distributed Hessian-free Optimization , 2012, INTERSPEECH.

[19] Daniel Povey,et al. Revisiting Recurrent Neural Networks for robust ASR , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).