论文信息 - JFA for speaker recognition with random digit strings

JFA for speaker recognition with random digit strings

In this paper, we examine the use of Joint Factor Analysis methods on RSR2015 part III (digits), [1]. A tied-mixture HMM is used for segmentation of the utterances into digits, while Joint Factor Analysis and a trainable backend are deployed for feature extraction and LLR calculation, respectively. A novel approach for digit-dependent fusion of UBMcomponent log-likelihood ratios is introduced, yielding the best results so far. The fusion of 5 different JFA features gives an equal-error rate of 3.6%, compared to 6.3% attained by the a baseline GMM-UBM model with score normalization. JFA for feature extraction JFA vs. i-vectors • The text-independent paradigm of i-vector/PLDA has not been successful in text-dependent speakerrecognition. The speaker-phrase variability is hard to be confined into a low-dimensional subspace. • JFA offers the flexibility of confining the channel effects in a subspace while allowing the speaker-phrace factors to lie on the supervector space, [2]. Main JFA equation S = m + Ux + V y + Dz (1) • The hidden variable x varies from one recording to another and is intended to model channel effects. • In text-independent speaker recognition, the term Dz is usually dropped and speakers are characterized by the low-dimensional vector y. Here, we extract either z or y features, [3]. JFA on utterances segmented into digits • JFA can be extended to utterances that are segmented into HMM states (digits). • Features can be global (digit-independent) or local (digit-dependent), supervectors-sized (z-vectors) or subspace (y-vectors). Segmentation and Baum-Welch stats Tied-Mixture HMM • Train a UBMand use its means and covariance matrices as codebook for a Tied-Mixture HMM (TMM) • The TMM has a single Gaussian codebook and digitdependent weights. • Very efficient for training and evaluating (Viterbi algorithm). • We use it also for extracting Baum-Welch stats for local features instead of the UBM. Training and evaluating the system Training the JFA and backend • Train a JFA model using both local and global features, z or y-vectors. (Several combinations are possible.) • Extract z or y-vectors, project them onto the unitsphere). • Train a Joint-Density Backend per feature. Evaluating the model • Apply Viterbi segmentation, extract z or y-vectors and use the JDB to calculate LLRs for each trial. • Apply score normalization and fuse score-normalized LLRs coming from multiple features. Joint-Density Backend An Alternative to PLDA • We model the joint-distribution of pairs of enrollment and test vectors under the same speaker hypothesis, [4]. • We use ”target” trials from the training set t = [ye , y T t ] T . • We estimate mean and covariance matrix (C). Assuming zero mean, C is as follows:

Themos Stafylakis | Patrick Kenny | Md. Jahangir Alam | Marcel Kockmann

[1] Themos Stafylakis,et al. Joint Factor Analysis for Text-Dependent Speaker Verification , 2014, Odyssey.

[2] Themos Stafylakis,et al. JFA modeling with left-to-right structure and a new backend for text-dependent speaker recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Patrick Kenny,et al. Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[4] Oren Barkan,et al. On leveraging conversational data for building a text dependent speaker verification system , 2013, INTERSPEECH.

[5] Themos Stafylakis,et al. JFA-based front ends for speaker recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Haizhou Li,et al. An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[7] Bin Ma,et al. Phonetically-constrained PLDA modeling for text-dependent speaker verification with multiple short utterances , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8] Themos Stafylakis,et al. Text-dependent speaker recognition using PLDA with uncertainty propagation , 2013, INTERSPEECH.

[9] Hagai Aronowitz,et al. Domain adaptation for text dependent speaker verification , 2014, INTERSPEECH.

[10] Sergey Novoselov,et al. Text-dependent GMM-JFA system for password based speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] Eduardo Lleida,et al. Factor analysis with sampling methods for text dependent speaker recognition , 2014, INTERSPEECH.

[12] Patrick Kenny,et al. Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[13] Matthieu Hébert,et al. Text-Dependent Speaker Recognition , 2008 .

[14] LiHaizhou,et al. An overview of text-independent speaker recognition , 2010 .

[15] Bin Ma,et al. Text-dependent speaker verification: Classifiers, databases and RSR2015 , 2014, Speech Commun..

[16] Pietro Laface,et al. Generative pairwise models for speaker recognition , 2014, Odyssey.

[17] Patrick Kenny,et al. Joint Factor Analysis of Speaker and Session Variability: Theory and Algorithms , 2006 .