JFA for speaker recognition with random digit strings

In this paper, we examine the use of Joint Factor Analysis methods on RSR2015 part III (digits), [1]. A tied-mixture HMM is used for segmentation of the utterances into digits, while Joint Factor Analysis and a trainable backend are deployed for feature extraction and LLR calculation, respectively. A novel approach for digit-dependent fusion of UBMcomponent log-likelihood ratios is introduced, yielding the best results so far. The fusion of 5 different JFA features gives an equal-error rate of 3.6%, compared to 6.3% attained by the a baseline GMM-UBM model with score normalization. JFA for feature extraction JFA vs. i-vectors • The text-independent paradigm of i-vector/PLDA has not been successful in text-dependent speakerrecognition. The speaker-phrase variability is hard to be confined into a low-dimensional subspace. • JFA offers the flexibility of confining the channel effects in a subspace while allowing the speaker-phrace factors to lie on the supervector space, [2]. Main JFA equation S = m + Ux + V y + Dz (1) • The hidden variable x varies from one recording to another and is intended to model channel effects. • In text-independent speaker recognition, the term Dz is usually dropped and speakers are characterized by the low-dimensional vector y. Here, we extract either z or y features, [3]. JFA on utterances segmented into digits • JFA can be extended to utterances that are segmented into HMM states (digits). • Features can be global (digit-independent) or local (digit-dependent), supervectors-sized (z-vectors) or subspace (y-vectors). Segmentation and Baum-Welch stats Tied-Mixture HMM • Train a UBMand use its means and covariance matrices as codebook for a Tied-Mixture HMM (TMM) • The TMM has a single Gaussian codebook and digitdependent weights. • Very efficient for training and evaluating (Viterbi algorithm). • We use it also for extracting Baum-Welch stats for local features instead of the UBM. Training and evaluating the system Training the JFA and backend • Train a JFA model using both local and global features, z or y-vectors. (Several combinations are possible.) • Extract z or y-vectors, project them onto the unitsphere). • Train a Joint-Density Backend per feature. Evaluating the model • Apply Viterbi segmentation, extract z or y-vectors and use the JDB to calculate LLRs for each trial. • Apply score normalization and fuse score-normalized LLRs coming from multiple features. Joint-Density Backend An Alternative to PLDA • We model the joint-distribution of pairs of enrollment and test vectors under the same speaker hypothesis, [4]. • We use ”target” trials from the training set t = [ye , y T t ] T . • We estimate mean and covariance matrix (C). Assuming zero mean, C is as follows:

[1]  Themos Stafylakis,et al.  Joint Factor Analysis for Text-Dependent Speaker Verification , 2014, Odyssey.

[2]  Themos Stafylakis,et al.  JFA modeling with left-to-right structure and a new backend for text-dependent speaker recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Oren Barkan,et al.  On leveraging conversational data for building a text dependent speaker verification system , 2013, INTERSPEECH.

[5]  Themos Stafylakis,et al.  JFA-based front ends for speaker recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[7]  Bin Ma,et al.  Phonetically-constrained PLDA modeling for text-dependent speaker verification with multiple short utterances , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Themos Stafylakis,et al.  Text-dependent speaker recognition using PLDA with uncertainty propagation , 2013, INTERSPEECH.

[9]  Hagai Aronowitz,et al.  Domain adaptation for text dependent speaker verification , 2014, INTERSPEECH.

[10]  Sergey Novoselov,et al.  Text-dependent GMM-JFA system for password based speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Eduardo Lleida,et al.  Factor analysis with sampling methods for text dependent speaker recognition , 2014, INTERSPEECH.

[12]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Matthieu Hébert,et al.  Text-Dependent Speaker Recognition , 2008 .

[14]  LiHaizhou,et al.  An overview of text-independent speaker recognition , 2010 .

[15]  Bin Ma,et al.  Text-dependent speaker verification: Classifiers, databases and RSR2015 , 2014, Speech Commun..

[16]  Pietro Laface,et al.  Generative pairwise models for speaker recognition , 2014, Odyssey.

[17]  Patrick Kenny,et al.  Joint Factor Analysis of Speaker and Session Variability: Theory and Algorithms , 2006 .