Factorized Sub-Space Estimation for Fast and Memory Effective I-vector Extraction

Most of the state-of-the-art speaker recognition systems use a compact representation of spoken utterances referred to as i-vector. Since the “standard” i-vector extraction procedure requires large memory structures and is relatively slow, new approaches have recently been proposed that are able to obtain either accurate solutions at the expense of an increase of the computational load, or fast approximate solutions, which are traded for lower memory costs. We propose a new approach particularly useful for applications that need to minimize their memory requirements. Our solution not only dramatically reduces the memory needs for i-vector extraction, but is also fast and accurate compared to recently proposed approaches. Tested on the female part of the tel-tel extended NIST 2010 evaluation trials, our approach substantially improves the performance with respect to the fastest but inaccurate eigen-decomposition approach, using much less memory than other methods.

[1]  Pietro Laface,et al.  Fast discriminative speaker verification in the i-vector space , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Patrick Kenny A small footprint i-vector extractor , 2012, Odyssey.

[3]  Douglas A. Reynolds,et al.  Language Recognition via i-vectors and Dimensionality Reduction , 2011, INTERSPEECH.

[4]  Rolf Dieter Grigorieff,et al.  A Note on von Neumann's Trace Inequalitv , 1991 .

[5]  Pietro Laface,et al.  Pairwise Discriminative Speaker Verification in the ${\rm I}$-Vector Space , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Hagai Aronowitz,et al.  Text dependent speaker verification using a small development set , 2012, Odyssey.

[7]  Niko Brümmer,et al.  The speaker partitioning problem , 2010, Odyssey.

[8]  Bengt J. Borgstrom,et al.  Discriminatively trained Bayesian speaker comparison of i-vectors , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Rui Xia,et al.  Using i-Vector Space Model for Emotion Recognition , 2012, INTERSPEECH.

[10]  Lukás Burget,et al.  Language Recognition in iVectors Space , 2011, INTERSPEECH.

[11]  Bin Ma,et al.  Phonetically-constrained PLDA modeling for text-dependent speaker verification with multiple short utterances , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Eduardo Lleida,et al.  Intra-session variability compensation and a hypothesis generation and selection strategy for speaker segmentation , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Lukás Burget,et al.  Discriminatively trained Probabilistic Linear Discriminant Analysis for speaker verification , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Pietro Laface,et al.  Probabilistic linear discriminant analysis of i-vector posterior distributions , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  James R. Glass,et al.  Exploiting Intra-Conversation Variability for Speaker Diarization , 2011, INTERSPEECH.

[16]  L. Mirsky A trace inequality of John von Neumann , 1975 .

[17]  Pietro Laface,et al.  Memory and computation effective approaches for i - vector extraction , 2012, Odyssey.

[18]  The NIST Year 2010 Speaker Recognition Evaluation Plan 1 I NTRODUCTION , 2022 .

[19]  Lukás Burget,et al.  Simplification and optimization of i-vector extraction , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Hugo Van hamme,et al.  Age Estimation from Telephone Speech using i-vectors , 2012, INTERSPEECH.

[21]  Haizhou Li,et al.  I-vectors in the context of phonetically-constrained short utterances for speaker verification , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Michael Elad,et al.  Dictionaries for Sparse Representation Modeling , 2010, Proceedings of the IEEE.

[23]  M. Hestenes,et al.  Methods of conjugate gradients for solving linear systems , 1952 .

[24]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Andreas Stolcke,et al.  Within-class covariance normalization for SVM-based speaker recognition , 2006, INTERSPEECH.

[26]  Themos Stafylakis,et al.  PLDA for speaker verification with utterances of arbitrary duration , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Patrick Kenny,et al.  Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.

[28]  Lukás Burget,et al.  Application of speaker- and language identification state-of-the-art techniques for emotion recognition , 2011, Speech Commun..

[29]  Pietro Laface,et al.  Memory and Computation Trade-Offs for Efficient I-Vector Extraction , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  Patrick Kenny,et al.  Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification , 2009, INTERSPEECH.

[31]  Oren Barkan,et al.  Efficient approximated i-vector extraction , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  David A. van Leeuwen Speaker linking in large data sets , 2010, Odyssey.

[33]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.