Modeling dynamic prosodic variation for speaker verification

Statistics of frame-level pitch have recently been used in speaker recognition systems with good results [1, 2, 3]. Although they convey useful long-term information about a speaker’s distribution of f0 values, such statistics fail to capture information about local dynamics in intonation that characterize an individual’s speaking style. In this work, we take a first step toward capturing such suprasegmental patterns for automatic speaker verification. Specifically, we model the speaker’s f0 movements by fitting a piecewise linear model to the f0 track to obtain a stylized f0 contour. Parameters of the model are then used as statistical features for speaker verification. We report results on 1998 NIST speaker verification evaluation. Prosody modeling improves the verification performance of a cepstrum-based Gaussian mixture model system (as measured by a task-specific Bayes risk) by 10%.