Short utterance recognition using a network with minimum training

Abstract A feedforward network is used to recognize short, digitized, isolated utterances. A high, multispeaker recognition rate is achieved with a small vocabulary with a single training utterance. This approach makes use of the pattern recognition property of the network architecture to classify different temporal patterns in the multidimensional feature space. The network recognizes the utterances without the need of segmentation, phoneme identification, or time alignment. We train the network with four words spoken by one single speaker. The network is then able to recognize 20 tokens spoken by 5 other speakers. We repeat the above training and testing procedure using a different speaker's utterances for training each time. The overall accuracy is 97.5%. We compare this approach to the traditional dynamic programming (DP) approach, and find that DP with slope constraints of 0 and 1 achieve 98.5% and 85% accuracies respectively. Finally we validate out statistics by training and testing the network of a four-word subset of the Texas Instruments (Tl) isolated word database. The accuracy with this vocabulary exceeds 96%. By doubling the size of the training set, the accuracy is raised to 98%. Using a suitable threshold, we are able to raise the accuracy of one network from 87% to 98.5%. Thresholding applied to all networks would then raise the overall accuracy to well over 99%. This technique is especially promising because of the low overhead and computational requirements, which make it suitable for a low cost, portable, command recognition type of application.