Continuous Speech Recognition with the Connectionist Viterbi Training Procedure: A Summary of Recent Work

Hybrid methods which combine hidden Markov models (HMMs) and connectionist techniques take advantage of what are believed to be the strong points of each of the two approaches: the powerful discrimination-based learning of connectionist networks and the time-alignment capability of HMMs. Connectionist Viterbi Training (CVT) is a simple variation of Viterbi training which uses a back-propagation network to represent the output distributions associated with the transitions in the HMM. The work reported here represents the culmination of three years of investigation of various means by which HMMs and neural networks (NNs) can be combined for continuous speech recognition. This paper describes the CVT procedure, discusses the factors most important to its design and reports its recognition performance. Several changes made to the system over the past year are reported here, including: (1) the change from recurrent to non-recurrent NNs, (2) the change from Sphinx-style phone-based HMMs to word-based HMMS, (3) the addition of a corrective training procedure, and (3) the addition of an alternate model for every word. The CVT system, incorporating these changes, achieves 99.1% word accuracy and 98.0% string accuracy on the TI/NBS Connected Digits task (“TI Digits”).

[1]  Richard Lippmann,et al.  HMM Speech Recognition with Neural Net Discrimination , 1989, NIPS.

[2]  Richard Lippmann,et al.  Neural Network Classifiers Estimate Bayesian a posteriori Probabilities , 1991, Neural Computation.

[3]  Kai-Fu Lee,et al.  On large-vocabulary speaker-independent continuous speech recognition , 1988, Speech Commun..

[4]  Padhraic Smyth,et al.  On loss functions which minimize to conditional expected values and posterior proba- bilities , 1993, IEEE Trans. Inf. Theory.

[5]  Ioannis A. Papazoglou,et al.  Markov Processes for Reliability Analyses of Large Systems , 1977, IEEE Transactions on Reliability.

[6]  Paul M. Frank,et al.  Fault diagnosis in dynamic systems using analytical and knowledge-based redundancy: A survey and some new results , 1990, Autom..

[7]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[8]  Raj Reddy,et al.  Large-vocabulary speaker-independent continuous speech recognition: the sphinx system , 1988 .

[9]  Rolf Isermann,et al.  Process Fault Detection Based on Modeling and Estimation Methods , 1982 .

[10]  R. Bakis Continuous speech recognition via centisecond acoustic states , 1976 .

[11]  Barak A. Pearlmutter Learning State Space Trajectories in Recurrent Neural Networks , 1989, Neural Computation.

[12]  Peter F. Brown,et al.  The acoustic-modeling problem in automatic speech recognition , 1987 .

[13]  Joseph Picone On modeling duration in context in speech recognition , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[14]  Rangasami L. Kashyap,et al.  Optimal feature selection and decision rules in classification problems with time series , 1978, IEEE Trans. Inf. Theory.