Incorporating Duration Information into I-Vector-Based Speaker Recognition Systems

Most of the existing literature on i-vector-based speaker r ecognition focuses on recognition problems, where i-vectors ar e extracted from speech recordings of sufficient length. The majority of modeling/recognition techniques therefore simp ly ignores the fact that the i-vectors are most likely estimated u nreliably when short recordings are used for their computati on. Only recently, were a number of solutions proposed in the lit erature to address the problem of duration variability, all tr ea ing the i-vector as a random variable whose posterior distribut ion can be parameterized by the posterior mean and the posterior covariance. In this setting the covariance matrix serves as a measure of uncertainty that is related to the length of the av ailable recording. In contract to these solutions, we address t he problem of duration variability through weighted statisti c . We demonstrate in the paper how established feature transform ation techniques regularly used in the area of speaker recogn ition, such as PCA or WCCN, can be modified to take duration into account. We evaluate our weighting scheme in the scope o f the i-vector challenge organized as part of the Odyssey, Speaker and Language Recognition Workshop 2014 and achieve a minimal DCF of 0.280, which at the time of writing puts our approach in third place among all the participating instituti ons.

[1]  John H. L. Hansen,et al.  CRSS systems for 2012 NIST Speaker Recognition Evaluation , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  France Mihelic,et al.  Fusion of Acoustic and Prosodic Features for Speaker Clustering , 2009, TSD.

[3]  Driss Matrouf,et al.  Study of the Effect of I-vector Modeling on Short and Mismatch Utterance Duration for Speaker Verification , 2012, INTERSPEECH.

[4]  John H. L. Hansen,et al.  Duration mismatch compensation for i-vector based speaker recognition systems , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Sridha Sridharan,et al.  i-vector Based Speaker Recognition on Short Utterances , 2011, INTERSPEECH.

[6]  Alan McCree,et al.  Subspace-constrained supervector PLDA for speaker verification , 2013, INTERSPEECH.

[7]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[8]  Niko Brümmer,et al.  The BOSARIS Toolkit: Theory, Algorithms and Code for Surviving the New DCF , 2013, ArXiv.

[9]  Andreas Stolcke,et al.  Generalized Linear Kernels for One-Versus-All Classification: Application to Speaker Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[10]  Sébastien Marcel,et al.  A Scalable Formulation of Probabilistic Linear Discriminant Analysis: Applied to Face Recognition , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  James R. Glass,et al.  Cosine Similarity Scoring without Score Normalization Techniques , 2010, Odyssey.

[12]  Pietro Laface,et al.  Probabilistic linear discriminant analysis of i-vector posterior distributions , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[14]  Patrick Kenny,et al.  Joint Factor Analysis of Speaker and Session Variability: Theory and Algorithms , 2006 .

[15]  Patrick Kenny,et al.  Mixture of PLDA Models in i-vector Space for Gender-Independent Speaker Recognition , 2011, INTERSPEECH.

[16]  Niko Brümmer,et al.  Towards Fully Bayesian Speaker Recognition: Integrating Out the Between-Speaker Covariance , 2011, INTERSPEECH.

[17]  Themos Stafylakis,et al.  Text-dependent speaker recognition using PLDA with uncertainty propagation , 2013, INTERSPEECH.

[18]  Vitomir Struc,et al.  The Complete Gabor-Fisher Classifier for Robust Face Recognition , 2010, EURASIP J. Adv. Signal Process..

[19]  Themos Stafylakis,et al.  PLDA for speaker verification with utterances of arbitrary duration , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  K. Lempert,et al.  CONDENSED 1,3,5-TRIAZEPINES - IV THE SYNTHESIS OF 2,3-DIHYDRO-1H-IMIDAZO-[1,2-a] [1,3,5] BENZOTRIAZEPINES , 1983 .

[21]  Douglas A. Reynolds,et al.  Summary and initial results of the 2013-2014 speaker recognition i-vector machine learning challenge , 2014, INTERSPEECH.

[22]  David A. van Leeuwen,et al.  Evaluation of i-vector Speaker Recognition Systems for Forensic Application , 2011, INTERSPEECH.

[23]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[24]  Sridha Sridharan,et al.  Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques , 2014, Speech Commun..

[25]  Umar Mohammed,et al.  Probabilistic Models for Inference about Identity , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Patrick Kenny,et al.  Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.

[27]  Vitomir Štruc,et al.  Combining experts for improved face verification performance * , 2008 .

[28]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[29]  Themos Stafylakis,et al.  A Study of the Cosine Distance-Based Mean Shift for Telephone Speech Diarization , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.