JFA-based front ends for speaker recognition

We discuss the limitations of the i-vector representation of speech segments in speaker recognition and explain how Joint Factor Analysis (JFA) can serve as an alternative feature extractor in a variety of ways. Building on the work of Zhao and Dong, we implemented a variational Bayes treatment of JFA which accommodates adaptation of universal background models (UBMs) in a natural way. This allows us to experiment with several types of features for speaker recognition: speaker factors and diagonal factors in addition to i-vectors, extracted with and without UBM adaptation in each case. We found that, in text-independent speaker verification experiments on NIST data, extracting i-vectors with UBM adaptation led to a 10% reduction in equal error rates although performance did not improve consistently over the whole DET curve. We achieved a further 10% reduction (with a similar inconsistency) by using speaker factors extracted with UBM adaptation as features. In text-dependent speaker recognition experiments on RSR2015 data, we were able to achieve very good performance using a JFA model with diagonal factors but no speaker factors as a feature extractor. Contrary to standard practice, this JFA model was configured so as to model speakerphrase combinations (rather than speakers) and it was trained on utterances of very short duration (rather than whole recording sessions). We also present a variant of the length normalization trick inspired by uncertainty propagation which leads to substantial gains in performance over the whole DET curve.

[1]  Kai Feng,et al.  The subspace Gaussian mixture model - A structured model for speech recognition , 2011, Comput. Speech Lang..

[2]  Patrick Kenny,et al.  Joint Factor Analysis of Speaker and Session Variability: Theory and Algorithms , 2006 .

[3]  Patrick Kenny,et al.  New MAP estimators for speaker recognition , 2003, INTERSPEECH.

[4]  Bin Ma,et al.  Phonetically-constrained PLDA modeling for text-dependent speaker verification with multiple short utterances , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Hagai Aronowitz Speaker recognition using kernel-PCA and intersession variability modeling , 2007, INTERSPEECH.

[6]  Yuan Dong,et al.  Variational Bayesian Joint Factor Analysis Models for Speaker Verification , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Yun Lei,et al.  A noise robust i-vector extractor using vector taylor series for speaker recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  David A. van Leeuwen,et al.  Fusion of Heterogeneous Speaker Recognition Systems in the STBU Submission for the NIST Speaker Recognition Evaluation 2006 , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Themos Stafylakis,et al.  Joint Factor Analysis for Text-Dependent Speaker Verification , 2014, Odyssey.

[11]  Patrick Kenny,et al.  Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.

[12]  Patrick Kenny,et al.  A Study of Interspeaker Variability in Speaker Verification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Patrick Kenny,et al.  Eigenvoice modeling with sparse training data , 2005, IEEE Transactions on Speech and Audio Processing.

[14]  Oren Barkan,et al.  On leveraging conversational data for building a text dependent speaker verification system , 2013, INTERSPEECH.

[15]  Bin Ma,et al.  The RSR2015: Database for Text-Dependent Speaker Verification using Multiple Pass-Phrases , 2012, Interspeech 2012.

[16]  Sridha Sridharan,et al.  Modelling session variability in text-independent speaker verification , 2005, INTERSPEECH.

[17]  P. Kenny,et al.  I-Vector / PLDA Variants for Text-Dependent Speaker Recognition , 2013 .

[18]  Themos Stafylakis,et al.  Text-dependent speaker recognition using PLDA with uncertainty propagation , 2013, INTERSPEECH.

[19]  Themos Stafylakis,et al.  PLDA for speaker verification with utterances of arbitrary duration , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Patrick Kenny,et al.  The role of speaker factors in the NIST extended data task , 2008, Odyssey.