Utterance normalization using vowel features in a spoken word recognition system for multiple speakers

The authors propose a novel method of normalization based on linear transformation of acoustic features of input speech using only one isolated utterance each of the five vowels of Japanese by each individual speaker. Experiments on isolated word recognition combining the proposed normalization method and multiple-template DP matching showed a marked improvement in the recognition rate, especially for smaller numbers of templates per word. The proposed method gives consistently higher word recognition scores than the four-dimensional representation on the Karhunen-Loeve transformation, and also gives higher scores than the original 16-dimensional representation of filter-bank outputs, especially when the number of templates is small. Together with the fact that this method reduces the dimension of the feature vector by a factor of four, the results demonstrate the validity of the proposed method.<<ETX>>