Prosodic knowledge sources for word hypothesization in a continuous speech recognition system

Previously we have reported on the extraction of prosodic cues (such as stress, pitch, duration) from continuous speech [1] and have reported on possible uses of some prosodic information (e.g., temporal cues [2]) in large vocabulary word recognition systems. In this paper we extend these previous findings to a speaker-independent continuous speech recognition system. Speaker-independent knowledge sources (KS) were implemented that attempt to hypothesize words based on only prosodic cues found in the signal. The prosodic cues exploited were temporal cues (syllable durations, ratios of unvoiced segment durations to syllable durations, voiced segment durations), intensity profiles and likelihoods of stressedness. Each KS extracts the appropriate prosodic cue and searches its knowledge base for words whose prosodic patterns satisfy the constraints found in the signal. Usign a multispeaker continuous speechdatabase for evaluation, each prosodic KS is shown to hypothesize the correct word substantially better than chance. All prosodic KSs were then combined and compared with a speaker-independent acoustic-phonetic word hypothesizer. After applying the prosodic KSs, the correct word ranked on average 25th (out of 252 words). The acoustic-phonetic KS alone yielded an average rank of 40 (out of 252) without the addition of prosodic information. After prosodic and phonetic KSs were combined the average rank was reduced to 15 out of 252. The results indicate that prosodic information indeed adds complementary information that substantially improves word hypothesization in speaker-independent continuous speech recognition systems.