Factor analysis with sampling methods for text dependent speaker recognition

Factor analysis is a method for embedding high dimensional data into a lower dimensional factor space. When data are multimodal we use mixtures of factor analyzers (MFA), which assume statistically independent samples. In speaker recognition, samples are not independent because they depend on the speaker in the utterance. In joint factor analysis and i-vectors, the MFA latent factors are tied at different levels. For example, they can be tied for a segment to extract utterance level information. Tied MFA approaches usually present the drawback that computing the exact posterior of the hidden variables (component responsibilities and latent factors) is unfeasible. For JFA, the preferred approximation consists in computing the responsibilities given a speaker independent GMM and they are fixed during the rest of the process. That implies that the estimated responsibilities for a given sample are independent of the rest of the samples of the utterance not taking into account the shared speaker and channel. We present a novel approximation to jointly estimate responsibilities and latent factors based on sampling the latent factor space. This model differs from previous ones in the hidden variables and parameter estimation; and likelihood evaluation. This approach was tested on the RSR2015 database for text-dependent speaker recognition

[1]  Patrick Kenny,et al.  A Study of Interspeaker Variability in Speaker Verification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Geoffrey E. Hinton,et al.  The EM algorithm for mixtures of factor analyzers , 1996 .

[3]  Roland Kuhn,et al.  Rapid speaker adaptation in eigenvoice space , 2000, IEEE Trans. Speech Audio Process..

[4]  Bin Ma,et al.  Text-dependent speaker verification: Classifiers, databases and RSR2015 , 2014, Speech Commun..

[5]  Patrick Kenny,et al.  Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification , 2009, INTERSPEECH.

[6]  Patrick Kenny,et al.  Eigenvoice modeling with sparse training data , 2005, IEEE Transactions on Speech and Audio Processing.

[7]  Patrick Kenny,et al.  An i-vector Extractor Suitable for Speaker Recognition with both Microphone and Telephone Speech , 2010, Odyssey.

[8]  Bin Ma,et al.  The RSR2015: Database for Text-Dependent Speaker Verification using Multiple Pass-Phrases , 2012, Interspeech 2012.

[9]  John H. L. Hansen,et al.  Maximum Likelihood Acoustic Factor Analysis Models for Robust Speaker Verification in Noise , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Patrick Kenny,et al.  Joint Factor Analysis of Speaker and Session Variability: Theory and Algorithms , 2006 .

[11]  Eduardo Lleida,et al.  Broadcast News Segmentation with Factor Analysis System , 2013, SLAM@INTERSPEECH.

[12]  Eduardo Lleida,et al.  Quality Assessment for Speaker Diarization and Its Application in Speaker Characterization , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Lukás Burget,et al.  iVector-based prosodic system for language identification , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Patrick Kenny,et al.  A Joint Factor Analysis Approach to Progressive Model Adaptation in Text-Independent Speaker Verification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Eduardo Lleida,et al.  Segmentation-by-classification system based on factor analysis , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Niko Brümmer,et al.  Towards Fully Bayesian Speaker Recognition: Integrating Out the Between-Speaker Covariance , 2011, INTERSPEECH.

[17]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[18]  John H. L. Hansen,et al.  Acoustic Factor Analysis for Robust Speaker Verification , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[20]  Themos Stafylakis,et al.  Text-dependent speaker recognition using PLDA with uncertainty propagation , 2013, INTERSPEECH.

[21]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[22]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Lukás Burget,et al.  Language Recognition in iVectors Space , 2011, INTERSPEECH.