Abstract A feedforward network is used to recognize short, digitized, isolated utterances. A high, multispeaker recognition rate is achieved with a small vocabulary with a single training utterance. This approach makes use of the pattern recognition property of the network architecture to classify different temporal patterns in the multidimensional feature space. The network recognizes the utterances without the need of segmentation, phoneme identification, or time alignment. We train the network with four words spoken by one single speaker. The network is then able to recognize 20 tokens spoken by 5 other speakers. We repeat the above training and testing procedure using a different speaker's utterances for training each time. The overall accuracy is 97.5%. We compare this approach to the traditional dynamic programming (DP) approach, and find that DP with slope constraints of 0 and 1 achieve 98.5% and 85% accuracies respectively. Finally we validate out statistics by training and testing the network of a four-word subset of the Texas Instruments (Tl) isolated word database. The accuracy with this vocabulary exceeds 96%. By doubling the size of the training set, the accuracy is raised to 98%. Using a suitable threshold, we are able to raise the accuracy of one network from 87% to 98.5%. Thresholding applied to all networks would then raise the overall accuracy to well over 99%. This technique is especially promising because of the low overhead and computational requirements, which make it suitable for a low cost, portable, command recognition type of application.
[1]
Dennis H. Klatt,et al.
Speech perception: a model of acoustic–phonetic analysis and lexical access
,
1979
.
[2]
S. Chiba,et al.
Dynamic programming algorithm optimization for spoken word recognition
,
1978
.
[3]
Terrence J. Sejnowski,et al.
Parallel Networks that Learn to Pronounce English Text
,
1987,
Complex Syst..
[4]
Jeffrey L. Elman,et al.
Interactive processes in speech perception: the TRACE model
,
1986
.
[5]
Lawrence R. Rabiner,et al.
An algorithm for determining the endpoints of isolated utterances
,
1975,
Bell Syst. Tech. J..
[6]
Geoffrey E. Hinton,et al.
Phoneme recognition using time-delay neural networks
,
1989,
IEEE Trans. Acoust. Speech Signal Process..
[7]
John E. Markel,et al.
Linear Prediction of Speech
,
1976,
Communication and Cybernetics.
[8]
F. Itakura,et al.
Minimum prediction residual principle applied to speech recognition
,
1975
.
[9]
Geoffrey E. Hinton,et al.
Learning internal representations by error propagation
,
1986
.
[10]
L. Rabiner,et al.
An introduction to hidden Markov models
,
1986,
IEEE ASSP Magazine.
[11]
James L. McClelland,et al.
Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations
,
1986
.