Best-first Model Merging for Hidden Markov Model Induction

This report describes a new technique for inducing the structure of Hidden Markov Models from data which is based on the general `model merging' strategy (Omohundro 1992). The process begins with a maximum likelihood HMM that directly encodes the training data. Successively more general models are produced by merging HMM states. A Bayesian posterior probability criterion is used to determine which states to merge and when to stop generalizing. The procedure may be considered a heuristic search for the HMM structure with the highest posterior probability. We discuss a variety of possible priors for HMMs, as well as a number of approximations which improve the computational efficiency of the algorithm. We studied three applications to evaluate the procedure. The first compares the merging algorithm with the standard Baum-Welch approach in inducing simple finite-state languages from small, positive-only training samples. We found that the merging procedure is more robust and accurate, particularly with a small amount of training data. The second application uses labelled speech data from the TIMIT database to build compact, multiple-pronunciation word models that can be used in speech recognition. Finally, we describe how the algorithm was incorporated in an operational speech understanding system, where it is combined with neural network acoustic likelihood estimators to improve performance over single-pronunciation word models.

[1]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[2]  A. Reber Implicit learning of artificial grammars , 1967 .

[3]  R. J. Nelson,et al.  Introduction to Automata , 1968 .

[4]  James Jay Horning,et al.  A study of grammatical inference , 1969 .

[5]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[6]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[7]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[8]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[9]  J. Rissanen A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .

[10]  Carl H. Smith,et al.  Inductive Inference: Theory and Methods , 1983, CSUR.

[11]  R. Redner,et al.  Mixture densities, maximum likelihood, and the EM algorithm , 1984 .

[12]  Michael G. Thomason,et al.  Dynamic Programming Inference of Markov Networks from Finite Sets of Sample Strings , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  C. S. Wallace,et al.  Estimation and Inference by Compact Coding , 1987 .

[14]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[15]  Jerome A. Feldman,et al.  Learning automata from ordered examples , 1991, COLT '88.

[16]  S. Gull Bayesian Inductive Inference and Maximum Entropy , 1988 .

[17]  James Kelly,et al.  AutoClass: A Bayesian Classification System , 1993, ML.

[18]  Ronald L. Rivest,et al.  Inferring Decision Trees Using the Minimum Description Length Principle , 1989, Inf. Comput..

[19]  Francine R. Chen Identification of contextual factors for pronunciation networks , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[20]  Steve Young,et al.  Applications of stochastic context-free grammars using the Inside-Outside algorithm , 1990 .

[21]  Wray L. Buntine Theory Refinement on Bayesian Networks , 1991, UAI.

[22]  Stephen M. Omohundro,et al.  Best-First Model Merging for Dynamic Learning and Recognition , 1991, NIPS.

[23]  Michael Riley,et al.  A statistical model for generating pronunciation networks , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[24]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[25]  Axel Cleeremans Mechanisms of implicit learning: a parallel distributed processing model of sequence acquisition , 1991 .

[26]  Wray L. BuntineRIACS Theory Reenement on Bayesian Networks , 1991 .

[27]  Chin-Hui Lee,et al.  Bayesian Learning of Gaussian Mixture Densities for Hidden Markov Models , 1991, HLT.

[28]  Fernando Pereira,et al.  Inside-Outside Reestimation From Partially Bracketed Corpora , 1992, HLT.

[29]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[30]  Wray L. Buntine,et al.  Learning classification trees , 1992 .

[31]  Penelope Sibun,et al.  A Practical Part-of-Speech Tagger , 1992, ANLP.

[32]  Andreas Stolcke,et al.  Hidden Markov Model} Induction by Bayesian Model Merging , 1992, NIPS.

[33]  Pierre Baldi,et al.  Hidden Markov Models in Molecular Biology: New Algorithms and Applications , 1992, NIPS.

[34]  Hervé Bourlard,et al.  Connectionist speech recognition , 1993 .

[35]  Dana Ron,et al.  The Power of Amnesia , 1993, NIPS.

[36]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[37]  D. Haussler,et al.  Protein modeling using hidden Markov models: analysis of globins , 1993, [1993] Proceedings of the Twenty-sixth Hawaii International Conference on System Sciences.

[38]  Andreas Stolcke,et al.  The berkeley restaurant project , 1994, ICSLP.

[39]  Andreas Stolcke,et al.  Multiple-pronunciation lexical modeling in a speaker independent speech understanding system , 1994, ICSLP.