Large Vocabulary Speech Recognition Using Deep Tensor Neural Networks

Recently, we proposed and developed the context-dependent deep neural network hidden Markov models (CD-DNN-HMMs) for large vocabulary speech recognition and achieved highly promising recognition results including over one third fewer word errors than the discriminatively trained, conventional HMM-based systems on the 300hr Switchboard benchmark task. In this paper, we extend DNNs to deep tensor neural networks (DTNNs) in which one or more layers are double-projection and tensor layers. The basic idea of the DTNN comes from our realization that many factors interact with each other to predict the output. To represent these interactions, we project the input to two nonlinear subspaces through the double-projection layer and model the interactions between these two subspaces and the output neurons through a tensor with three-way connections. Evaluation on 30hr Switchboard task indicates that DTNNs can outperform DNNs with similar number of parameters with 5% relative word error reduction.

[1]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[3]  Hervé Bourlard,et al.  Connectionist probability estimators in HMM speech recognition , 1994, IEEE Trans. Speech Audio Process..

[4]  Dong Yu,et al.  Exploiting sparseness in deep neural networks for large vocabulary speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Dong Yu,et al.  FACTORIZED DEEP NEURAL NETWORKS FOR ADAPTIVE SPEECH RECOGNITION , 2012 .

[6]  Geoffrey Zweig,et al.  Automated directory assistance system - from theory to practice , 2007, INTERSPEECH.

[7]  Dong Yu,et al.  Roles of Pre-Training and Fine-Tuning in Context-Dependent DBN-HMMs for Real-World Speech Recognition , 2010 .

[8]  Dong Yu,et al.  Investigation of full-sequence training of deep belief networks for speech recognition , 2010, INTERSPEECH.

[9]  Geoffrey E. Hinton,et al.  Gated Softmax Classification , 2010, NIPS.

[10]  Dong Yu,et al.  A deep architecture with bilinear modeling of hidden representations: Applications to phonetic recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Dong Yu,et al.  Pipelined Back-Propagation for Context-Dependent Deep Neural Networks , 2012, INTERSPEECH.

[13]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.