In this paper, we examine the use of Joint Factor Analysis methods on RSR2015 part III (digits), [1]. A tied-mixture HMM is used for segmentation of the utterances into digits, while Joint Factor Analysis and a trainable backend are deployed for feature extraction and LLR calculation, respectively. A novel approach for digit-dependent fusion of UBMcomponent log-likelihood ratios is introduced, yielding the best results so far. The fusion of 5 different JFA features gives an equal-error rate of 3.6%, compared to 6.3% attained by the a baseline GMM-UBM model with score normalization. JFA for feature extraction JFA vs. i-vectors • The text-independent paradigm of i-vector/PLDA has not been successful in text-dependent speakerrecognition. The speaker-phrase variability is hard to be confined into a low-dimensional subspace. • JFA offers the flexibility of confining the channel effects in a subspace while allowing the speaker-phrace factors to lie on the supervector space, [2]. Main JFA equation S = m + Ux + V y + Dz (1) • The hidden variable x varies from one recording to another and is intended to model channel effects. • In text-independent speaker recognition, the term Dz is usually dropped and speakers are characterized by the low-dimensional vector y. Here, we extract either z or y features, [3]. JFA on utterances segmented into digits • JFA can be extended to utterances that are segmented into HMM states (digits). • Features can be global (digit-independent) or local (digit-dependent), supervectors-sized (z-vectors) or subspace (y-vectors). Segmentation and Baum-Welch stats Tied-Mixture HMM • Train a UBMand use its means and covariance matrices as codebook for a Tied-Mixture HMM (TMM) • The TMM has a single Gaussian codebook and digitdependent weights. • Very efficient for training and evaluating (Viterbi algorithm). • We use it also for extracting Baum-Welch stats for local features instead of the UBM. Training and evaluating the system Training the JFA and backend • Train a JFA model using both local and global features, z or y-vectors. (Several combinations are possible.) • Extract z or y-vectors, project them onto the unitsphere). • Train a Joint-Density Backend per feature. Evaluating the model • Apply Viterbi segmentation, extract z or y-vectors and use the JDB to calculate LLRs for each trial. • Apply score normalization and fuse score-normalized LLRs coming from multiple features. Joint-Density Backend An Alternative to PLDA • We model the joint-distribution of pairs of enrollment and test vectors under the same speaker hypothesis, [4]. • We use ”target” trials from the training set t = [ye , y T t ] T . • We estimate mean and covariance matrix (C). Assuming zero mean, C is as follows:
[1]
Themos Stafylakis,et al.
Joint Factor Analysis for Text-Dependent Speaker Verification
,
2014,
Odyssey.
[2]
Themos Stafylakis,et al.
JFA modeling with left-to-right structure and a new backend for text-dependent speaker recognition
,
2015,
2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[3]
Patrick Kenny,et al.
Joint Factor Analysis Versus Eigenchannels in Speaker Recognition
,
2007,
IEEE Transactions on Audio, Speech, and Language Processing.
[4]
Oren Barkan,et al.
On leveraging conversational data for building a text dependent speaker verification system
,
2013,
INTERSPEECH.
[5]
Themos Stafylakis,et al.
JFA-based front ends for speaker recognition
,
2014,
2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[6]
Haizhou Li,et al.
An overview of text-independent speaker recognition: From features to supervectors
,
2010,
Speech Commun..
[7]
Bin Ma,et al.
Phonetically-constrained PLDA modeling for text-dependent speaker verification with multiple short utterances
,
2013,
2013 IEEE International Conference on Acoustics, Speech and Signal Processing.
[8]
Themos Stafylakis,et al.
Text-dependent speaker recognition using PLDA with uncertainty propagation
,
2013,
INTERSPEECH.
[9]
Hagai Aronowitz,et al.
Domain adaptation for text dependent speaker verification
,
2014,
INTERSPEECH.
[10]
Sergey Novoselov,et al.
Text-dependent GMM-JFA system for password based speaker verification
,
2014,
2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[11]
Eduardo Lleida,et al.
Factor analysis with sampling methods for text dependent speaker recognition
,
2014,
INTERSPEECH.
[12]
Patrick Kenny,et al.
Front-End Factor Analysis for Speaker Verification
,
2011,
IEEE Transactions on Audio, Speech, and Language Processing.
[13]
Matthieu Hébert,et al.
Text-Dependent Speaker Recognition
,
2008
.
[14]
LiHaizhou,et al.
An overview of text-independent speaker recognition
,
2010
.
[15]
Bin Ma,et al.
Text-dependent speaker verification: Classifiers, databases and RSR2015
,
2014,
Speech Commun..
[16]
Pietro Laface,et al.
Generative pairwise models for speaker recognition
,
2014,
Odyssey.
[17]
Patrick Kenny,et al.
Joint Factor Analysis of Speaker and Session Variability: Theory and Algorithms
,
2006
.