State clustering in hidden Markov model-based continuous speech recognition

Abstract A key problem in the use of context-dependent bidden Markov models is the need to balance the desired model complexity with the amount of available training data. This paper describes a method which uses a simple agglomerative algorithm to cluster and tie acoustically similar states. The main properties of the algorithm are explored using phone recognition on the TIMIT database where it is shown that there is an optimum between the clustering extrema of an untied context-dependent system and a fully tied monophone system. At this optimum, phone recognition performance was 76·7% correct and 72·3% accuracy. The use of state-tying in the HTK continuous speech recognition system is then described and results are presented using the Resource Management database. The average error rate across the Feb '89, Oct '89 and Feb '91 test sets was less than 4·3% and this was achieved without cross-word triphones. Gender-dependent models were also compared to gender-independent models but found to give little improvement.