Efficient training algorithms for HMMs using incremental estimation

Typically, parameter estimation for a hidden Markov model (HMM) is performed using an expectation-maximization (EM) algorithm with the maximum-likelihood (ML) criterion. The EM algorithm is an iterative scheme that is well-defined and numerically stable, but convergence may require a large number of iterations. For speech recognition systems utilizing large amounts of training material, this results in long training times. This paper presents an incremental estimation approach to speed-up the training of HMMs without any loss of recognition performance. The algorithm selects a subset of data from the training set, updates the model parameters based on the subset, and then iterates the process until convergence of the parameters. The advantage of this approach is a substantial increase in the number of iterations of the EM algorithm per training token, which leads to faster training. In order to achieve reliable estimation from a small fraction of the complete data set at each iteration, two training criteria are studied; ML and maximum a posteriori (MAP) estimation. Experimental results show that the training of the incremental algorithms is substantially faster than the conventional (batch) method and suffers no loss of recognition performance. Furthermore, the incremental MAP based training algorithm improves performance over the batch version.

[1]  M. Degroot Optimal Statistical Decisions , 1970 .

[2]  Radford M. Neal A new view of the EM algorithm that justifies incremental and other variants , 1993 .

[3]  G. C. Tiao,et al.  Bayesian inference in statistical analysis , 1973 .

[4]  Chin-Hui Lee,et al.  On-line adaptive learning of the continuous density hidden Markov model based on approximate recursive Bayes estimate , 1997, IEEE Trans. Speech Audio Process..

[5]  Michael I. Jordan,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1994, Neural Computation.

[6]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[7]  R. Redner,et al.  Mixture densities, maximum likelihood, and the EM algorithm , 1984 .

[8]  Teuvo Kohonen,et al.  Speech recognition: a hybrid approach , 1998 .

[9]  Pierre Baldi,et al.  Hidden Markov Models in Molecular Biology: New Algorithms and Applications , 1992, NIPS.

[10]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[11]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[12]  Ehud Weinstein,et al.  Sequential algorithms for parameter estimation based on the Kullback-Leibler information measure , 1990, IEEE Trans. Acoust. Speech Signal Process..

[13]  John B. Moore,et al.  On-line estimation of hidden Markov model parameters based on the Kullback-Leibler information measure , 1993, IEEE Trans. Signal Process..

[14]  Pierre Baldi,et al.  Smooth On-Line Learning Algorithms for Hidden Markov Models , 1994, Neural Computation.

[15]  Hervé Bourlard,et al.  Continuous speech recognition , 1995, IEEE Signal Process. Mag..

[16]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[17]  Michael Maurice Hochberg A comparison of state-duration modeling techniques for connected speech recognition , 1993 .

[18]  Michael I. Jordan,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1994 .

[19]  C. J. Wellekens Mixture density estimators in Viterbi training , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[21]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[22]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.