Reevaluation of the Significance of Sequence Information for Speech Recognition

A central difficulty with automatic speech recognition is the temporally inaccurate nature of the speech signal. Despite this, speech has been traditionally modeled as a purely sequential (albeit probabilistic) process. The usefulness of accurate sequence information is re-evaluated in this paper, both at the acoustic and lexical levels for the task of speech recognition. At the acoustic level, speech segments are quantized into discrete vectors, and converted into set representations as opposed to accurate sequences. Recognition of the quantized vector sets dramatically improved performance as contrasted with the corresponding vector sequence representations. At the lexical level, our study suggests that accurate sequence information is, again, not crucial. In fact locally discarding phoneme sequence information may be useful for coping with errors (such as insertion, substitution). Based on the idea of phone set indexing, a lexical access algorithm is developed. Thus, this work questions the traditional approach of modeling speech as a purely sequential process, and suggests that discarding local sequential information may be a good idea. As an alternative to a purely sequential representation, a set representation seems to be a viable option.