A DTW-based dissimilarity measure for left-to-right hidden Markov models and its application to word confusability analysis

We propose a dynamic time-warping (DTW) based distortion measure for measuring the dissimilarity between pairs of left-to-right continuous density hidden Markov models with state observation densities being mixture of Gaussians. The local distortion score required in DTW is defined as an approximate Kullback-Leibler divergence (KLD) between two Gaussian mixture models (GMMs). Several approximate KLDs are studied and compared for pairs of GMMs with different properties, and one of them is identified for being used in our DTW-based HMM dissimilarity measure. In an experiment of identifying automatically the subsets of confusable Putonghua (Mandarin Chinese) syllables, it is observed that the result based on the proposed HMM dissimilarity measure is highly consistent with the one based on syllable recognition confusion matrix obtained on a testing data set.

[1]  Lin-Shan Lee,et al.  Pronunciation variation analysis based on acoustic and phonemic distance measures with application examples on Mandarin Chinese , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[2]  M. Do Fast approximation of Kullback-Leibler distance for dependence trees and hidden Markov models , 2003, IEEE Signal Processing Letters.

[3]  Jeffrey K. Uhlmann,et al.  Unscented filtering and nonlinear estimation , 2004, Proceedings of the IEEE.

[4]  Javier Hernando,et al.  Word confusability prediction in automatic speech recognition , 2004, INTERSPEECH.

[5]  Markus Falkhausen,et al.  Calculation of distance measures between hidden Markov models , 1995, EUROSPEECH.

[6]  Lou Boves,et al.  Predicting word correct rate from acoustic and linguistic confusability , 2004, INTERSPEECH.

[7]  L. R. Rabiner,et al.  A probabilistic distance measure for hidden Markov models , 1985, AT&T Technical Journal.

[8]  Matti Vihola,et al.  Two dissimilarity measures for HMMS and their application in phoneme model clustering , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Joachim Köhler,et al.  Multi-lingual phoneme recognition exploiting acoustic-phonetic similarities of sounds , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[10]  Richard M. Stern,et al.  Structured redefinition of sound units by merging and splitting for improved speech recognition , 2000, INTERSPEECH.

[11]  Shrikanth S. Narayanan,et al.  A statistical discrimination measure for hidden Markov models based on divergence , 2004, INTERSPEECH.

[12]  Michael Riley,et al.  Prediction of word confusabilities for speech recognition , 1994, ICSLP.

[13]  Shiri Gordon,et al.  An efficient image similarity measure based on approximations of KL-divergence between two gaussian mixtures , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[14]  Surajit Ray,et al.  The topography of multivariate normal mixtures , 2005 .