Optimal tying of HMM mixture densities using decision trees

The most detailed acoustic models in our two-pass speaker-independent, continuous speech recognition system are context-dependent models, which become more difficult to adequately train as the number of different contexts becomes large. Tying of model parameters or clustering of model densities based on bottom-up agglomerative procedures can efficiently reduce the number of parameters to train, but suffer from the additional problem of how to model untrained contexts. Top-down clustering with a decision tree can provide well-trained models for any context, whether seen or unseen in training. Trees are built from a root node that is successively split by selecting, among questions about phonetic context, one that provides the best segregation of data. Several goodness of split criterions have been proposed, such as Poisson-based (Bahl et al., 1991), or single Gaussian-based (Bahl et al., 1994), their choice being primarily motivated by computational considerations. We show, from maximum likelihood considerations, how to derive a computationally efficient criterion based on a different approximation using tied mixtures of Gaussian densities.

[1]  Vassilios Digalakis,et al.  Genones: optimizing the degree of mixture tying in a large vocabulary hidden Markov model based speech recognizer , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Kai-Fu Lee,et al.  Automatic Speech Recognition , 1989 .

[3]  Michael Picheny,et al.  Context Dependent Modeling of Phones in Continuous Speech Using Decision Trees , 1991, HLT.

[4]  Michael Picheny,et al.  Robust methods for using context-dependent features and models in a continuous speech recognizer , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Paul Labute,et al.  Bi-directional graph search strategies for speech recognition , 1996, Comput. Speech Lang..

[6]  P.C. Woodland,et al.  The 1994 HTK large vocabulary speech recognition system , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[7]  Douglas D. O'Shaughnessy,et al.  Experiments in continuous speech recognition using books on tape , 1994, Speech Commun..

[8]  Roland Kuhn,et al.  Improved decision trees for phonetic modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[9]  Mei-Yuh Hwang,et al.  Senones, multi-pass search, and unified stochastic modeling in sphinx-II , 1993, EUROSPEECH.

[10]  George Zavaliagkos,et al.  Comparative Experiments on Large Vocabulary Speech Recognition , 1993, HLT.

[11]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .