A comparison of state-duration modeling techniques for connected speech recognition

State duration in Hidden Markov model (HMM) speech recognition systems has traditionally been modeled with self-transitions, forcing state durations to be geometrically distributed. There is strong evidence that the geometric distribution is not the most appropriate duration model and, for this reason, other HMM representations with more general duration models have been proposed. In this thesis, a hidden semi-Markov model (HSMM) is used to explicitly model arbitrary state-duration distributions. The expectation-maximization (EM) algorithm is used to determine the state-duration probabilities for a nonparametric distribution. The nonparametric terms are shown to be sufficient statistics for the reestimation of any duration distribution parameter from the regular exponential family. In addition, the EM algorithm reestimation equations are derived for mixture durations where the individual mixture distributions are from the regular exponential family. Experimental results compare the geometric, Poisson, shifted Poisson, gamma, Gaussian, mixture and nonparametric duration models and indicate the shifted Poisson distribution provides the highest recognition accuracy on a talker-independent, connected-alphadigit recognition task. Analysis shows that appropriate modeling of second-order duration statistics increases recognition performance significantly. Two alternative HMM formulations are proposed to take advantage of the improvements of higher-order duration statistics modeling. The constrained variance HMM (CV-HMM) incorporates knowledge of true word-duration statistics into the explicit duration model and results in the best recognition results obtained on the task to date. Finally, the linear variance HMM (LV-HMM) addresses the computational complexity issue in the implementation of an HSMM. The LV-HMM approximates the Poisson model for state duration with an appropriately specified traditional HMM. Results show improved performance over the geometric distribution with no additional parameters and without the computational complexity of a fully explicit duration model.