Exploring Temporal Domain for Robustness in Speech Recognition

I. Abstract The paper reviews several techniques which are used in conjunction with the short-term analysis and which are reported to be more robust in presence of noise or other non-linguistic factors. We show that one property common to all such techniques is that they are eeectively extracting speech features from segments of speech longer than 10-20 ms. The communication channel and its noise level remains most often xed or varies only rather slowly during the conversation. On the other hand, steady conngurations of vocal tract are rare and carry only a little of linguistic information. The description of speech signal as a succession of equally spaced short-term samples originated in speech coding. It assumes that short-term (about 10-20 ms) segments of speech are independent samples from different and unrelated stationary processes. Fundamental linguistic unit is likely to be longer than 10 ms and one frame of short-term analysis result provides description of its relatively short (quasi-stationary) part. Since only a short-term "snapshot" of the signal is available at any given time, it is hard to distinguish between the "short-term quasi-stationary" signals (such as speech) and "long-term quasi-stationary" disturbances (such as xed frequency characteristics of the communication channel or noise). It appears that the short-term memory of auditory periphery in mammals (exhibited e.g. by the forward masking (see e.g. ?]), ring rate adaptation constant (see e.g. ?]), buildup of the loudness (see e.g. ?])) is at least of the order of about 200 ms, i.e. an order of magnitude longer than the temporal window of the short-term analysis. That means the peripheral human auditory system can eeectively integrate rather large (about syllable sized) time-spans of the audio signal. III. Beyond 20 ms Many speech researchers (see e.g. ?]) do not seem to be aware that some of techniques which are in use in feature extraction for ASR already do consider rather large time-spans of the speech signal. We brieey describe below several techniques (some of them rather well established), for post-processing of short-term speech feature vectors which all claim increased robustness in presence of non-linguistic factors in speech and argue that this increased robustness results from the fact that they do use global knowledge well beyond 20 ms. A. Dynamic (Delta) Features Furui ?] introduced dynamic features of speech to describe time trajectories of speech parameters in the vicinity of a given speech vector. He proposed the rst three coeecients …

[1]  J. C. Stevens,et al.  Brightness and loudness as functions of stimulus duration , 1966 .

[2]  J. C. Stevens,et al.  Brightness and loudness as functions of stimulus duration , 1966 .

[3]  W. R. Webster,et al.  Click-evoked response patterns of single units in the medial geniculate body of the cat. , 1966, Journal of neurophysiology.

[4]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[5]  T. M. Cannon,et al.  Blind deconvolution through digital signal processing , 1975, Proceedings of the IEEE.

[6]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[7]  Shozo Makino,et al.  Recognition of consonant based on the perceptron model , 1983, ICASSP.

[8]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[9]  Hynek Hermansky,et al.  Continuous speech recognition using PLP analysis with multilayer perceptrons , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[10]  Ronald A. Cole,et al.  Speaker-independent phonetic classification in continuous English letters , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[11]  Brian Hanson,et al.  Regression features for recognition of speech in quiet and in noise , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[12]  Hideki Kawahara,et al.  A dynamic cepstrum incorporating time-frequency masking and its application to continuous speech recognition , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Dieter Geller,et al.  Improvements in connected digit recognition using linear discriminant analysis and mixture densities , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  H. Hermansky,et al.  Temporal masking in automatic speech recognition , 1994 .

[15]  Aaron E. Rosenberg,et al.  Cepstral channel normalization techniques for HMM-based speaker verification , 1994, ICSLP.

[16]  Michiel Bacchiani,et al.  Optimization of time-frequency masking filters using the minimum classification error criterion , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Richard M. Schwartz,et al.  Adaptation to new microphones using tied-mixture normalization , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.