Memory-aware i-vector extraction by means of sub-space factorization

Most of the state-of-the-art speaker recognition systems use i-vectors, a compact representation of spoken utterances. Since the “standard” i-vector extraction procedure requires large memory structures, we recently presented the Factorized Sub-space Estimation (FSE) approach, an efficient technique that dramatically reduces the memory needs for i-vector extraction, and is also fast and accurate compared to other proposed approaches. FSE is based on the approximation of the matrix T, representing the speaker variability sub-space, by means of the product of appropriately designed matrices. In this work, we introduce and evaluate a further approximation of the matrices that most contribute to the memory costs in the FSE approach, showing that it is possible to obtain comparable system accuracy using less than a half of FSE memory, which corresponds to more than 60 times memory reduction with respect to the standard method of i-vector extraction.

[1]  J. J. Modi,et al.  An alternative givens ordering , 1984 .

[2]  Rolf Dieter Grigorieff,et al.  A Note on von Neumann's Trace Inequalitv , 1991 .

[3]  Pietro Laface,et al.  Probabilistic linear discriminant analysis of i-vector posterior distributions , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Bin Ma,et al.  Phonetically-constrained PLDA modeling for text-dependent speaker verification with multiple short utterances , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Hagai Aronowitz,et al.  Text dependent speaker verification using a small development set , 2012, Odyssey.

[6]  John H. L. Hansen,et al.  Acoustic Factor Analysis for Robust Speaker Verification , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Bengt J. Borgstrom,et al.  Discriminatively trained Bayesian speaker comparison of i-vectors , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[9]  L. Mirsky A trace inequality of John von Neumann , 1975 .

[10]  W. Gentleman Error analysis of QR decompositions by Givens transformations , 1975 .

[11]  The NIST Year 2010 Speaker Recognition Evaluation Plan 1 I NTRODUCTION , 2022 .

[12]  D. C. Youla,et al.  A Normal form for a Matrix under the Unitary Congruence Group , 1961, Canadian Journal of Mathematics.

[13]  Patrick Kenny,et al.  Mixture of PLDA Models in i-vector Space for Gender-Independent Speaker Recognition , 2011, INTERSPEECH.

[14]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[15]  Pietro Laface,et al.  Memory and computation effective approaches for i - vector extraction , 2012, Odyssey.

[16]  Haizhou Li,et al.  I-vectors in the context of phonetically-constrained short utterances for speaker verification , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  David J. Kuck,et al.  On Stable Parallel Linear System Solvers , 1978, JACM.

[18]  Andreas Stolcke,et al.  Within-class covariance normalization for SVM-based speaker recognition , 2006, INTERSPEECH.

[19]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Niko Brümmer,et al.  The speaker partitioning problem , 2010, Odyssey.

[21]  Lukás Burget,et al.  Simplification and optimization of i-vector extraction , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Themos Stafylakis,et al.  PLDA for speaker verification with utterances of arbitrary duration , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Bin Ma,et al.  Sparse Classifier Fusion for Speaker Verification , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Patrick Kenny,et al.  Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.

[25]  Lukás Burget,et al.  Full-covariance UBM and heavy-tailed PLDA in i-vector speaker verification , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Patrick Kenny A small footprint i-vector extractor , 2012, Odyssey.

[27]  Pietro Laface,et al.  On the use of i–vector posterior distributions in Probabilistic Linear Discriminant Analysis , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[28]  Pietro Laface,et al.  Memory and Computation Trade-Offs for Efficient I-Vector Extraction , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Oren Barkan,et al.  Efficient approximated i-vector extraction , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Balaji Vasan Srinivasan,et al.  A Symmetric Kernel Partial Least Squares Framework for Speaker Recognition , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  Pietro Laface,et al.  Factorized Sub-Space Estimation for Fast and Memory Effective I-vector Extraction , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[32]  Niko Brümmer,et al.  Towards Fully Bayesian Speaker Recognition: Integrating Out the Between-Speaker Covariance , 2011, INTERSPEECH.