Speech trajectory clustering for improved speech recognition

Context dependent modelling is known to improve recognition performance for automatic speech recognition. One of the major limitations, especially of approaches based on Decision Trees, is that the questions thatguidethesearchfor effectivecontextsmustbeknown in advance. However, the variation in the speech signals is caused by multiple factors, not all of which may be known during the training procedure. State tying methods, on the other hand, are strictly local, and therefore do not allow to reap the benefits of variation that spans longer length units such as syllables. In this paper, we present an approach that does not require prior knowledge and that still can find the most important variants of speech units of arbitrary length. The method is based on clustering the multi-dimensional dynamic trajectories correspondingto speech units. Thus, we define multipath model topologies based on automatically derived clusters of dynamic trajectories (Trajectory Clustering based hidden Markov models, TCHMMs). In this paper we compare the clusters obtained with Trajectory Clustering and knowledge based context dependent Head and Tail models in a Head-Body-Tail model (HBT) connected digits recognition task. Our results show that TCHMMs outperform conventional HBT models significantly.