Trajectory Clustering for Solving the Trajectory Folding Problem in Automatic Speech Recognition

In this paper, we introduce a novel method for clustering speech gestures, represented as continuous trajectories in acoustic parameter space. Trajectory Clustering allows us to avoid the conditional independence assumption that makes it difficult to account for the fact that successive measurements of an articulatory gesture are correlated. We apply the trajectory clustering method for developing multiple parallel hidden Markov models (HMMs) for a continuous digits recognition task. We compare the performance obtained with data-driven clustering to the recognition performance obtained with conventional head-body-tail models, which use knowledge-based criteria for building multiple HMMs in order to obviate the trajectory folding problem. The results show that trajectory clustering is able to discover structure in the the training database that is different from the structure assumed by the knowledge-based approach. In addition, the data-derived structure gives rise to significantly better recognition performance, and results in a 10% word error rate reduction

[1]  Joseph Picone Duration in context clustering for speech recognition , 1990, Speech Commun..

[2]  Dominique Genoud,et al.  An overview of the CAVE project research activities in speaker verification , 2000, Speech Commun..

[3]  Li Deng,et al.  Speaker-independent phonetic classification using hidden Markov models with mixtures of trend functions , 1997, IEEE Trans. Speech Audio Process..

[4]  Alex Bateman,et al.  An introduction to hidden Markov models. , 2007, Current protocols in bioinformatics.

[5]  Naonori Ueda,et al.  EM algorithm with split and merge operations for mixture models , 2000 .

[6]  Jian Su,et al.  Speaker time-drifting adaptation using trajectory mixture hidden Markov models , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[7]  W. Cleveland,et al.  Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting , 1988 .

[8]  Padhraic Smyth,et al.  Trajectory clustering with mixtures of regression models , 1999, KDD '99.

[9]  Ananth Sankar Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition , 2007 .

[10]  Lou Boves,et al.  The Dutch polyphone corpus , 1995, EUROSPEECH.

[11]  Li Deng,et al.  Speaker-independent phonetic classification using hidden Markov models with state-conditioned mixtures of trend functions , 1997 .

[12]  Yifan Gong,et al.  Elimination of trajectory folding phenomenon: HMM, trajectory mixture HMM and mixture stochastic trajectory model , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Yan Han,et al.  Trajectory Clustering of Syllable-Length Acoustic Models for Continuous Speech Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[14]  Li Deng,et al.  Speech trajectory discrimination using the minimum classification error learning , 1998, IEEE Trans. Speech Audio Process..

[15]  Naonori Ueda,et al.  EM algorithm with split and merge operations for mixture models , 2000, Systems and Computers in Japan.

[16]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[17]  Herbert Gish,et al.  Parametric trajectory models for speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[18]  Eric Sanders,et al.  Modelling phonetic context using head-body-tail models for connected digit recognition , 2000, INTERSPEECH.

[19]  Biing-Hwang Juang,et al.  Generalized mixture of HMMs for continuous speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  Herbert Gish,et al.  A segmental speech model with applications to word spotting , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[21]  Biing-Hwang Juang,et al.  Minimum error rate training of inter-word context dependent acoustic model units in speech recognition , 1994, ICSLP.

[22]  Xiaodong Sun,et al.  Speech recognition using hidden Markov models with polynomial regression functions as nonstationary states , 1994, IEEE Trans. Speech Audio Process..

[23]  Yifan Gong,et al.  Stochastic trajectory modeling and sentence searching for continuous speech recognition , 1997, IEEE Trans. Speech Audio Process..