INCORPORATING CONTEXTUAL PHONETICS INTO AUTOMATIC SPEECH RECOGNITION

This work outlines the problems encountered in modeling pronunciation for automatic speech recognition (ASR) of spontaneous (American) English speech. We detail some of the phonetic phenomena within the Switchboard corpus that make the recognition of this speaking style difficult. Phonetic transcribers found that feature spreading and cue trading made identification of phonetic segmental boundaries problematic. Including different forms of context in pronunciation models, however, may alleviate these problems in the ASR domain. The syllable appears to play an important role, as many of the phonetic phenomena seen are syllable-internal, and the increase in pronunciation variation compared to read speech is concentrated in coda consonants. In addition, we show that other forms of context – speaking rate and word predictability – help indicate increases in variability. We present a dynamic ASR pronunciation model that utilizes longer phonetic contextual windows for capturing the range of detail characteristic of naturally spoken language.