Investigation of full-sequence training of deep belief networks for speech recognition

Recently, Deep Belief Networks (DBNs) have been proposed for phone recognition and were found to achieve highly competitive performance. In the original DBNs, only framelevel information was used for training DBN weights while it has been known for long that sequential or full-sequence information can be helpful in improving speech recognition accuracy. In this paper we investigate approaches to optimizing the DBN weights, state-to-state transition parameters, and language model scores using the sequential discriminative training criterion. We describe and analyze the proposed training algorithm and strategy, and discuss practical issues and how they affect the final results. We show that the DBNs learned using the sequence-based training criterion outperform those with frame-based criterion using both threelayer and six-layer models, but the optimization procedure for the deeper DBN is more difficult for the former criterion.

[1]  N. Morgan,et al.  Pushing the envelope - aside [speech recognition] , 2005, IEEE Signal Processing Magazine.

[2]  James Glass,et al.  Research Developments and Directions in Speech Recognition and Understanding, Part 1 , 2009 .

[3]  Wu Chou,et al.  Discriminative learning in sequential pattern recognition , 2008, IEEE Signal Processing Magazine.

[4]  Dong Yu,et al.  Language recognition using deep-structured conditional random fields , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  J.A. Bilmes,et al.  Graphical model architectures for speech recognition , 2005, IEEE Signal Processing Magazine.

[6]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[7]  Geoffrey E. Hinton,et al.  Binary coding of speech spectrograms using a deep auto-encoder , 2010, INTERSPEECH.

[8]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[9]  Dong Yu,et al.  Structured speech modeling , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Geoffrey E. Hinton,et al.  Phone recognition using Restricted Boltzmann Machines , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Yochai Konig,et al.  Remap: recursive estimation and maximization of a posteriori probabilities in transition-based speech recognition , 1996 .

[12]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[13]  Brian Kingsbury,et al.  Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Dong Yu,et al.  Deep-structured hidden conditional random fields for phonetic recognition , 2010, INTERSPEECH.

[15]  Geoffrey E. Hinton,et al.  Deep Belief Networks for phone recognition , 2009 .

[16]  Pavel Matejka,et al.  Hierarchical Structures of Neural Networks for Phoneme Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[17]  James R. Glass,et al.  Developments and directions in speech recognition and understanding, Part 1 [DSP Education] , 2009, IEEE Signal Processing Magazine.

[18]  James R. Glass,et al.  Updated Minds Report on Speech Recognition and Understanding, Part 2 Citation Baker, J. Et Al. " Updated Minds Report on Speech Recognition and Understanding, Part 2 [dsp Education]. " Signal Processing Accessed Terms of Use , 2022 .

[19]  Li Deng,et al.  Learning in the Deep-Structured Conditional Random Fields , 2009 .

[20]  Eric Fosler-Lussier,et al.  Backpropagation training for multilayer conditional random field based phone recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Dong Yu,et al.  Solving Nonlinear Estimation Problems Using Splines , 2009 .

[22]  Yifan Gong,et al.  A Novel Framework and Training Algorithm for Variable-Parameter Hidden Markov Models , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  J. S. Bridle,et al.  An Alphanet approach to optimising input transformations for continuous speech recognition , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.