Using prosodic and conversational features for high-performance speaker recognition: report from JHU WS'02

While there has been a long tradition of research seeking to use prosodic features, especially pitch, in speaker recognition systems, results have generally been disappointing when such features are used in isolation and only modest improvements have been seen when used in conjunction with traditional cepstral GMM systems. In contrast, we report here on work from the JHU 2002 Summer Workshop exploring a range of prosodic features, using as testbed the 2001 NIST Extended Data task. We examined a variety of modeling techniques, such as n-gram models of turn-level prosodic features and simple vectors of summary statistics per conversation side scored by k/sup th/ nearest-neighbor classifiers. We found that purely prosodic models were able to achieve equal error rates of under 10%, and yielded significant gains when combined with more traditional systems. We also report on exploratory work on "conversational" features, capturing properties of the interaction across conversation sides, such as turn-taking patterns.

[1]  Gökhan Tür,et al.  Prosody-based automatic segmentation of speech into sentences and topics , 2000, Speech Commun..

[2]  B. Atal Automatic Speaker Recognition Based on Pitch Contours , 1969 .

[3]  Michael J. Carey,et al.  Robust prosodic features for speaker identification , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[4]  Douglas A. Reynolds,et al.  The SuperSID project: exploiting high-level information for high-accuracy speaker recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[5]  Elizabeth Shriberg,et al.  Using prosodic and lexical information for speaker identification , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Larry P. Heck,et al.  Modeling dynamic prosodic variation for speaker verification , 1998, ICSLP.

[7]  Larry P. Heck,et al.  A lognormal tied mixture model of pitch for prosody based speaker recognition , 1997, EUROSPEECH.

[8]  Douglas A. Reynolds,et al.  Modeling prosodic dynamics for speaker recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[9]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..