On the spectrographic representation of rapidly time-varying speech

Abstract A spectrogram is the normal display for visual interpretation of the speech signal and, in one form or another, spectrographic data are very often used as the feature space for speech recognition. It is generally accepted that normal sampling rates and window sizes may be inappropriate for good spectrography given time-varying speech. We conjecture that rapid time variation is much more insidious than that; a spectrogram can indicate formant tracks which totally belie the vocal-tract generation process. It is shown that increased bandwidth due to rapid time variation can mask an expected, instantaneous spectral representation; current spectral analyses are very likely to provide inconsistent information for accurate classification of rapidly time-varying events such as stops.

[1]  David C. Munson Minimum sampling rates for linear shift-variant discrete-time systems , 1985, IEEE Trans. Acoust. Speech Signal Process..

[2]  Harvey F. Silverman,et al.  A general language-operated decision implementation system (GLODIS): Its application to continuous-speech segmentation , 1976 .

[3]  Paul Wallich,et al.  Putting speech recognizers to work , 1987 .

[4]  P. Wallich Putting speech recognizers to work: While advances in signal processing and algorithms would extend their usefulness, limited models are already meeting many inspection and inventory applications , 1987 .

[5]  Leigh Lisker The pursuit of invariance in speech signals , 1983 .

[6]  D Kewley-Port,et al.  Time-varying features as correlates of place of articulation in stop consonants. , 1983, The Journal of the Acoustical Society of America.

[7]  S. Blumstein,et al.  A reconsideration of acoustic invariance for place of articulation in diffuse stop consonants: evidence from a cross-language study. , 1981, The Journal of the Acoustical Society of America.

[8]  Biing-Hwang Juang,et al.  A model-based connected-digit recognition system using either hidden Markov models or templates , 1986 .

[9]  S. Öhman Coarticulation in VCV Utterances: Spectrographic Measurements , 1966 .

[10]  P. Doornenbal,et al.  On the Air Resistance and the Bernoulli Effect of the Human Larynx , 1957 .

[11]  Dennis H. Klatt,et al.  Software for a cascade/parallel formant synthesizer , 1980 .

[12]  N. R. Ganguli,et al.  Recognition of unaspirated plosives--A statistical approach , 1980 .

[13]  Harvey F. Silverman,et al.  A parametrically controlled spectral analysis system for speech , 1974 .

[14]  A. Rosenberg Effect of glottal pulse shape on the quality of natural vowels. , 1969 .

[15]  S. Roucos,et al.  The role of word-dependent coarticulatory effects in a phoneme-based speech recognition system , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Pietro Laface,et al.  Computer recognition of plosive sounds using contextual information , 1983 .

[17]  Oded Ghitza,et al.  Auditory nerve representation as a front-end for speech recognition in a noisy environment , 1986 .

[18]  S. Blumstein,et al.  Acoustic invariance in speech production: evidence from measurements of the spectral characteristics of stop consonants. , 1979, The Journal of the Acoustical Society of America.

[19]  Gary E. Kopec,et al.  Voiceless stop consonant identification using LPC spectra , 1984, ICASSP.

[20]  Frederick Jelinek,et al.  A real-time, isolated-word, speech recognition system for dictation transcription , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.