Hidden Markov models with templates as non-stationary states: an application to speech recognition

Abstract In most implementations of hidden Markov models (HMMs) a state is assumed to be a stationary random sequence of observation vectors whose mean and covariance are estimated. Successive observations in a state are assumed to be independent and identically distributed. These assumptions are reasonable when each state represents a short segment of the speech signal. When states represent longer portions of the signal (e.g. phonemes, diphones, etc.) both assumptions are inaccurate. Recently, some attempts have been made to incorporate correlations between successive observations in a state. But to our knowledge, non-stationarity has not been dealt with. We propose an alternative representation in which a state of an HMM is defined as a template, i.e. a "typical" sequence of observations. The template for a state is derived from an ensemble of segments corresponding to that state. In our present implementation, the observations are 11th-order cepstrum vectors plus energy, states represent diphones and ensembles of the diphones are obtained from a hand-labeled speaker-dependent database of 2000 sentences spoken fluently. The probability of a test sequence being generated in a given state is obtained by time-warping the test utterance to the template, and assuming the differences between the corresponding observations to have a joint distribution. Tests on 50 sentences (outside the training set) indicate a correct recognition rate for phonemes of about 70%.