JFA modeling with left-to-right structure and a new backend for text-dependent speaker recognition

This paper introduces a new formulation of Joint Factor Analysis (JFA) for text-dependent speaker recognition based on left-to-right modeling with tied mixture HMMs. It accommodates many different ways of extracting multiple features to characterize speakers (features may or may not be HMM state-dependent, they may be modeled with subspace or factorial priors and these priors maybe imputed from text-dependent or text-independent background data). We feed these features to a new, trainable classifier for text-dependent speaker recognition in a manner which is broadly analogous to the i-vector/PLDA cascade in text-independent speaker recognition. We have evaluated this approach on a challenging proprietary dataset consisting of telephone recordings of short English and Urdu pass-phrases collected in Pakistan. By fusing results obtained with multiple front ends, equal error rate of around 2% are achievable.

[1]  Patrick Kenny,et al.  New cosine similarity scorings to implement gender-independent speaker verification , 2013, INTERSPEECH.

[2]  Bin Ma,et al.  Phonetically-constrained PLDA modeling for text-dependent speaker verification with multiple short utterances , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Sridha Sridharan,et al.  Explicit modelling of session variability for speaker verification , 2008, Comput. Speech Lang..

[4]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Themos Stafylakis,et al.  JFA-based front ends for speaker recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Bin Ma,et al.  Text-dependent speaker verification: Classifiers, databases and RSR2015 , 2014, Speech Commun..

[7]  Ronald A. Cole,et al.  The CSLU speaker recognition corpus , 1998, ICSLP.

[8]  Pietro Laface,et al.  Generative pairwise models for speaker recognition , 2014, Odyssey.

[9]  Patrick Kenny,et al.  Joint Factor Analysis of Speaker and Session Variability: Theory and Algorithms , 2006 .

[10]  Themos Stafylakis,et al.  In-domain versus out-of-domain training for text-dependent JFA , 2014, INTERSPEECH.

[11]  Themos Stafylakis,et al.  Text-dependent speaker recognition using PLDA with uncertainty propagation , 2013, INTERSPEECH.

[12]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Eduardo Lleida,et al.  Factor analysis with sampling methods for text dependent speaker recognition , 2014, INTERSPEECH.

[14]  Themos Stafylakis,et al.  Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition , 2014, Odyssey.

[15]  Themos Stafylakis,et al.  Joint Factor Analysis for Text-Dependent Speaker Verification , 2014, Odyssey.

[16]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Hagai Aronowitz,et al.  Domain adaptation for text dependent speaker verification , 2014, INTERSPEECH.

[18]  Oren Barkan,et al.  On leveraging conversational data for building a text dependent speaker verification system , 2013, INTERSPEECH.