论文信息 - Reevaluation of the Significance of Sequence Information for Speech Recognition

Reevaluation of the Significance of Sequence Information for Speech Recognition

A central difficulty with automatic speech recognition is the temporally inaccurate nature of the speech signal. Despite this, speech has been traditionally modeled as a purely sequential (albeit probabilistic) process. The usefulness of accurate sequence information is re-evaluated in this paper, both at the acoustic and lexical levels for the task of speech recognition. At the acoustic level, speech segments are quantized into discrete vectors, and converted into set representations as opposed to accurate sequences. Recognition of the quantized vector sets dramatically improved performance as contrasted with the corresponding vector sequence representations. At the lexical level, our study suggests that accurate sequence information is, again, not crucial. In fact locally discarding phoneme sequence information may be useful for coping with errors (such as insertion, substitution). Based on the idea of phone set indexing, a lexical access algorithm is developed. Thus, this work questions the traditional approach of modeling speech as a purely sequential process, and suggests that discarding local sequential information may be a good idea. As an alternative to a purely sequential representation, a set representation seems to be a viable option.

Dana H. Ballard | Ramesh R. Sarukkai

[1] Jorma Laaksonen,et al. LVQ_PAK: The Learning Vector Quantization Program Package , 1996 .

[2] Pentti Kanerva,et al. Sparse distributed memory and related models , 1993 .

[3] Lawrence R. Rabiner,et al. A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[4] Mitch Weintraub,et al. Large-vocabulary dictation using SRI's DECIPHER speech recognition system: progressive search techniques , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5] Dana H. Ballard,et al. Phonetic Set Hashing: A Novel Scheme for Transforming Phone Sequences to Words , 1994 .

[6] Erkki Reuhkala,et al. On-line recognition of spoken words from a large vocabulary , 1984, Inf. Sci..

[7] M. Riley. Speech Time-Frequency Representations , 1989 .

[8] Akinori Ito,et al. A new word pre-selection method based on an extended redundant hash addressing for continuous speech recognition , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9] John W. Sammon,et al. A Nonlinear Mapping for Data Structure Analysis , 1969, IEEE Transactions on Computers.