Prosodic speaker verification using subspace multinomial models with intersession compensation

We propose a novel approach to modeling prosodic features. Inspired by Joint Factor Analysis model (JFA), our model is based on the same idea of introducing subspace of model parameters. However, the underlying Gaussian Mixture distribution of JFA is replaced by multinomial distribution to model sequences of discrete units rather than continuous features. In this work, we use the subspace model as a feature extractor for support vector machines (SVMs), similar to the recently proposed JFA in total variability space. We can show the capability to reduce high-dimensional count vectors to low dimension while keeping system performance stable. With additional intersession compensation, we can improve 30% relative to the baseline system and reach an equal error rate of 8.8% on the NIST 2006 SRE dataset. Index Terms: speaker verification, prosody, JFA, multinomial model

[1]  Andreas Stolcke,et al.  Modeling NERFs for speaker recognition , 2004, Odyssey.

[2]  Andreas Stolcke,et al.  Modeling prosodic feature sequences for speaker recognition , 2005, Speech Commun..

[3]  Andreas Stolcke,et al.  Within-class covariance normalization for SVM-based speaker recognition , 2006, INTERSPEECH.

[4]  Patrick Kenny,et al.  Modeling Prosodic Features With Joint Factor Analysis for Speaker Verification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Elizabeth Shriberg,et al.  Parameterization of Prosodic Feature Distributions for SVM Modeling in Speaker Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[6]  Lukás Burget,et al.  Advances in phonotactic language recognition , 2008, INTERSPEECH.

[7]  Patrick Kenny,et al.  A Study of Interspeaker Variability in Speaker Verification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Elizabeth Shriberg,et al.  A comparison of approaches for modeling prosodic features in speaker recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Kai Feng,et al.  SUBSPACE GAUSSIAN MIXTURE MODELS FOR SPEECH RECOGNITION , 2009 .

[10]  Lukás Burget,et al.  Investigations into prosodic syllable contour features for speaker recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.