The synergy between speech production and perception.

Speech intelligibility is known to be relatively unaffected by certain deformations of the acoustic spectrum. These include translations, stretching or contracting dilations, and shearing of the spectrum (represented along the logarithmic frequency axis). It is argued here that such robustness reflects a synergy between vocal production and auditory perception. Thus, on the one hand, it is shown that these spectral distortions are produced by common and unavoidable variations among different speakers pertaining to the length, cross-sectional profile, and losses of their vocal tracts. On the other hand, it is argued that these spectral changes leave the auditory cortical representation of the spectrum largely unchanged except for translations along one of its representational axes. These assertions are supported by analyses of production and perception models. On the production side, a simplified sinusoidal model of the vocal tract is developed which analytically relates a few "articulatory" parameters, such as the extent and location of the vocal tract constriction, to the spectral peaks of the acoustic spectra synthesized from it. The model is evaluated by comparing the identification of synthesized sustained vowels to labeled natural vowels extracted from the TIMIT corpus. On the perception side a "multiscale" model of sound processing is utilized to elucidate the effects of the deformations on the representation of the acoustic spectrum in the primary auditory cortex. Finally, the implications of these results for the perception of generally identifiable classes of sound sources beyond the specific case of speech and the vocal tract are discussed.

[1]  D. Broadbent,et al.  Information Conveyed by Vowels , 1957 .

[2]  M. Halle,et al.  Preliminaries to Speech Analysis: The Distinctive Features and Their Correlates , 1961 .

[3]  D. D. Greenwood Critical Bandwidth and the Frequency Coordinates of the Basilar Membrane , 1961 .

[4]  M. Schroeder Determination of the geometry of the human vocal tract by acoustic measurements. , 1967, The Journal of the Acoustical Society of America.

[5]  P. Mermelstein Determination of the vocal-tract shape from measured formant frequencies. , 1967, The Journal of the Acoustical Society of America.

[6]  B. Gopinath,et al.  Determination of the shape of the human vocal tract from acoustical measurements , 1970, Bell Syst. Tech. J..

[7]  B. Lindblom,et al.  Acoustical consequences of lip, tongue, jaw, and larynx movement. , 1970, The Journal of the Acoustical Society of America.

[8]  J. Flanagan Speech Analysis, Synthesis and Perception , 1971 .

[9]  A. Rosenberg Effect of glottal pulse shape on the quality of natural vowels. , 1969, The Journal of the Acoustical Society of America.

[10]  Kenneth N. Stevens,et al.  On the quantal nature of speech , 1972 .

[11]  Waveforms Hisashi Wakita Direct Estimation of the Vocal Tract Shape by Inverse Filtering of Acoustic Speech , 1973 .

[12]  Michael Rodney Portnoff A quasi-one-dimensional digital simulation for the time-varying vocal tract. , 1973 .

[13]  M M Merzenich,et al.  Representation of cochlea within primary auditory cortex in the cat. , 1975, Journal of neurophysiology.

[14]  P. Ladefoged,et al.  Factor analysis of tongue shapes. , 1971, Journal of the Acoustical Society of America.

[15]  P. Ladefoged,et al.  Generating vocal tract shapes from formant frequencies. , 1978, The Journal of the Acoustical Society of America.

[16]  B. Atal,et al.  Inversion of articulatory-to-acoustic transformation in the vocal tract by a computer-sorting technique. , 1978, The Journal of the Acoustical Society of America.

[17]  B. Moore,et al.  Suggested formulae for calculating auditory-filter bandwidths and excitation patterns. , 1983, The Journal of the Acoustical Society of America.

[18]  Christian Abry,et al.  Vocalic nomograms: Acoustic and articulatory considerations upon formant convergences , 1990 .

[19]  Björn Lindblom,et al.  Explaining Phonetic Variation: A Sketch of the H&H Theory , 1990 .

[20]  C E Schreiner,et al.  Functional topography of cat primary auditory cortex: distribution of integrated excitation. , 1990, Journal of neurophysiology.

[21]  Shinji Maeda,et al.  Compensatory Articulation During Speech: Evidence from the Analysis and Synthesis of Vocal-Tract Shapes Using an Articulatory Model , 1990 .

[22]  Aaron E. Rosenberg,et al.  Improved acoustic modeling for large vocabulary continuous speech recognition , 1992 .

[23]  S. Shamma,et al.  Organization of response areas in ferret primary auditory cortex. , 1993, Journal of neurophysiology.

[24]  Ce Schreiner,et al.  Spectral envelope coding in cat primary auditory cortex: Properties of ripple transfer functions , 1994 .

[25]  S. Shamma,et al.  Analysis of dynamic spectra in ferret primary auditory cortex. II. Prediction of unit responses to arbitrary dynamic spectra. , 1996, Journal of neurophysiology.

[26]  S. Shamma,et al.  Spectro-temporal modulation transfer functions and speech intelligibility. , 1999, The Journal of the Acoustical Society of America.

[27]  R. Shannon,et al.  Recognition of spectrally degraded and frequency-shifted vowels in acoustic and electric hearing. , 1999, The Journal of the Acoustical Society of America.

[28]  Deniz Baskent,et al.  Speech recognition under conditions of frequency-place compression and expansion. , 2003, The Journal of the Acoustical Society of America.