An acoustic-phonetic and articulatory study of speech-speaker dichotomy

The acoustic speech signal naturally contains information pertaining to both the linguistic message and the identity of the speaker. Separation of these two, primary sources of variability has been of considerable concern to speech scientists for many years, yet we still lack a model of the speech communication process that can account for a wide variety of speakers even for a restricted set. of vowels of a given dialect of spoken English. Basic research was therefore undertaken to gain a better understanding of the separate influences and of the interactions between phonetic and speaker-specific attributes of vowel sounds. First, a new methodology for performing computer vowel recognition was developed, which has the ability to reveal the contrastive influence of vowel-speaker interactions across different regions of the frequency spectrum. The results obtained from applying this methodology have: (i) yielded significant insights into the long-standing problem of phonetic-speaker dichotomy; and (ii) prompted a search for an even more fundamental explanation in terms of the physical properties of the speech production mechanism. To this end, a new functional representation of vocal-tract shapes was derived, which depends directly on resonance parameters while retaining the uniqueness properties of the Linear-Prediction model of speech production. This hybrid modelling approach was used together with a new, articulatory method of speaker normalisation, to quantify speaker differences in vocal-tract shapes, and thus to define physical correlates of the phonetic-speaker interactions which were earlier shown to adversely affect vowel recognition accuracy in certain frequency bands. In sum, this research work embraces two major domains of computer speech science and technology—namely, speech acoustics and speech production. Beyond advancing knowledge in speaker characterisation, it ultimately has implications in the still unresolved problem of speech or speaker recognition by computer.