Speaker independent isolated digit recognition using hidden Markov models

A method for speaker independent isolated digit recognition based on modeling entire words as discrete probabilistic functions of a Markov chain is described. Training is a three part process comprising conventional methods of linear prediction coding (LPC) and vector quantization of the LPCs followed by an algorithm for estimating the parameters of a hidden Markov process. Recognition utilizes linear prediction and vector quantization steps prior to maximum likelihood classification based on the Viterbi algorithm. Vector quantization is performed by a K-means algorithm which finds a codebook of 64 prototypical vectors that minimize the distortion measure (Itakura distance) over the training set. After training based on a 1,000 token set, recognition experiments were conducted on a separate 1,000 token test set obtained from the same talkers. In this test a 3.5% error rate was observed which is comparable to that measured in an identical test of an LPC/DTW (dynamic time warping) system. The computational demand for recognition under the new system is reduced by a factor of approximately 10 in both time and memory compared to that of the LPC/DTW system. It is also of interest that the classification errors made by the two systems are virtually disjoint; thus the possibility exists to obtain error rates near 1% by a combination of the methods. In describing our experiments we discuss several issues of theoretical importance, namely: 1) Alternatives to the Baum-Welch algorithm for model parameter estimation, e.g., Lagrangian techniques; 2) Model combining techniques by means of a bipartite graph matching algorithm providing improved model stability; 3) Methods for treating the finite training data problem by modifications to both the Baum-Welch algorithm and Lagrangian techniques; and 4) Use of non-ergodic Markov chains for isolated word recognition. We note that the experiments reported here are the first in which a direct comparison is made between two conceptually different (i.e. parametric and non-parametric) methods of treating the non-stationarity problem in speech recognition by implicitly dividing the speech signal into quasi-stationary intervals.