A comparative study of spectral peaks versus global spectral shape as invariant acoustic cues for vowels

The primary objective of this study was to compare two sets of vowel spectral features, formants and global spectral shape parameters, as invariant acoustic cues to vowel identity. Both automatic vowel recognition experiments and perceptual experiments were performed to evaluate these two feature sets. First, these features were compared using the static spectrum sampled in the middle of each steady-state vowel versus features based on dynamic spectra. Second, the role of dynamic and contextual information was investigated in terms of improvements in automatic vowel classification rates. Third, several speaker normalizing methods were examined for each of the feature sets. Finally, perceptual experiments were performed to determine whether vowel perception is more correlated with formants or global spectral shape. Results of the automatic vowel classification experiments indicate that global spectral shape features contain more information than do formants. For both feature sets, dynamic features are superior to static features. Spectral features spanning a time interval beginning with the start of the on-glide region of the acoustic vowel segment and ending at the end of the off-glide region of the acoustic vowel segment are required for maximum vowel recognition accuracy. Speaker normalization of both static and dynamic features can also be used to improve the automatic vowel recognition accuracy. Results of the perceptual experiments with synthesized vowel segments indicate that if formants are kept fixed, global spectral shape can, at least for some conditions, be modified such that the synthetic speech token will be perceived according to spectral shape cues rather than formant cues. This result implies that overall spectral shape may be more important perceptually than the spectral prominences represented by the formants. The results of this research contribute to a fundamental understanding of the information-encoding process in speech. The signal processing techniques used and the acoustic features found in this study can also be used to improve the preprocessing of acoustic signals in the front-end of automatic speech recognition systems.

[1]  Gunnar Fant,et al.  Acoustic Theory Of Speech Production , 1960 .

[2]  R.A.W. Bladon,et al.  Outline of an Auditory Theory of Speaker Normalization , 1984 .

[3]  D. Shankweiler,et al.  Consonant environment specifies vowel identity. , 1976, The Journal of the Acoustical Society of America.

[4]  Β. Lindblom,et al.  On the Role of Formant Transitions in Vowel Recognition , 1968 .

[5]  W. Strange Evolving theories of vowel perception. , 1987, The Journal of the Acoustical Society of America.

[6]  T. M. Nearey Phonetic feature systems for vowels , 1978 .

[7]  Anthony Bladon Two-formant models of vowel perception: Shortcomings and enhancement , 1983, Speech Commun..

[8]  L. Gerstman Classification of self-normalized vowels , 1968 .

[9]  H Kuwabara An approach to normalization of coarticulation effects for vowels in connected speech. , 1985, The Journal of the Acoustical Society of America.

[10]  T. M. Nearey Static, dynamic, and relational properties in vowel perception. , 1989, The Journal of the Acoustical Society of America.

[11]  David Ralph Williams Role of dynamic information in the perception of coarticulated vowels , 1986 .

[12]  S. Singh,et al.  Perceptual structure of 12 American English vowels. , 1971, The Journal of the Acoustical Society of America.

[13]  S. McCandless,et al.  An algorithm for automatic formant extraction using linear prediction spectra , 1974 .

[14]  Alan V. Oppenheim,et al.  Discrete representation of signals , 1972 .

[15]  R. Plomp,et al.  Perceptual and physical space of vowel sounds. , 1969, The Journal of the Acoustical Society of America.

[16]  David J. Broad Toward Defining Acoustic Phonetic Equivalence for Vowels , 1976 .

[17]  W. Strange,et al.  Identification of coarticulated vowels. , 1980, The Journal of the Acoustical Society of America.

[18]  Stephen A. Zahorian,et al.  Minimum mean-square error transformations of categorical data to target positions , 1992, IEEE Trans. Signal Process..

[19]  Dennis H. Klatt,et al.  Software for a cascade/parallel formant synthesizer , 1980 .

[20]  K. Stevens,et al.  Perturbation of vowel articulations by consonantal context: an acoustical study. , 1963, Journal of speech and hearing research.

[21]  O. Fujimura,et al.  On the Second Spectral Peak of Front Vowels: a Perceptual Study of the Role of the Second and Third Formants , 1967, Language and speech.

[22]  G. E. Peterson,et al.  Control Methods Used in a Study of the Vowels , 1951 .

[23]  Edward P. Neuburg Dynamic frequency warping, the dual of dynamic time warping , 1987 .

[24]  E. P. Neuburg Frequency warping by dynamic programming , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[25]  R. J. Golibersuch Automatic prediction of linear frequency warp for speech recognition , 1983, ICASSP.

[26]  H. S. Gopal,et al.  A perceptual model of vowel recognition based on the auditory representation of American English vowels. , 1986, The Journal of the Acoustical Society of America.

[27]  Louis C. W. Pols,et al.  Spectral analysis and identification of Dutch vowels in monosyllabic words , 1977 .

[28]  Kuldip K. Paliwal,et al.  Dynamic frequency warping for speaker adaptation in automatic speech recognition , 1985 .

[29]  G. Fairbanks,et al.  A psychophysical investigation of vowel formants. , 1961, Journal of speech and hearing research.

[30]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[31]  Edward P. Neuburg Frequency-axis warping to improve automatic word recognition , 1980, ICASSP.

[32]  H. Wakita Normalization of vowels by vocal-tract length and its application to vowel identification , 1977 .

[33]  S. F. Disner Evaluation of vowel normalization procedures. , 1980, The Journal of the Acoustical Society of America.

[34]  G. Fant,et al.  Two-formant Models, Pitch and Vowel Perception , 1975 .