Speech recognition with character string encoding

An isolated word recognition system that uses character string encoding is described that has achieved 98% correct recognition scores on limited vocabularies (20-54 words). Speaker normalization, word segmentation, and learning paradigms have been incorporated. Audio input passes through a 6-channel octave band pass filter bank. The output of each channel is time integrated for 10 ms, and log mapped. An utterance is represented by a succession of points (a new point is generated every 10 ms) in the 6- dimensional space defined by the 6 octave bands. Reference points are scattered throughout the space. Each time interval is assigned the label of the nearest reference point. We call the resulting string of labels a "character string". Encoding an utterance into a character string may proceed with an arbitrary degree of precision, greater resolution resulting from the use of more reference points. Only 24 reference points are needed to achieve 98% correct recognition scores for 54 words in near real time. String generation techniques are explored. Several learning schemes based on character strings are described. Finally, experiments with a software classifier that uses "deformable templates" based on character strings are presented.