Generalizing I-Vector Estimation for Rapid Speaker Recognition

An i-vector is a compact representation that captures both the speaker and session variabilities rendered in a spoken utterance. Over the past years, it has prevailed over other techniques and is now the de facto representation for text-independent speaker recognition. Standard i-vector extraction requires intense computation at run-time. Reducing the computation will allow effective use of i-vector in more applications. Such intense computation arises from the posterior covariance matrix, when estimating the i-vector. There have been studies on how to simplify the computation of posterior covariance matrix with modest success. In this paper, we propose a novel approach to i-vector extraction without the need to evaluate the full posterior covariance thereby speeding up the run-time extraction process. This is achieved by generalizing the i-vector estimation in two ways. First, we introduce the use of occupancy reweighting in conjunction with whitening over the Baum–Welch statistics as part of the preprocessing step. Second, we introduce the so-called subspace-orthogonalizing prior (SOP) to replace the standard Gaussian prior in i-vector formulation. Experiments conducted on the extended-core task of NIST SRE’10 show that the proposed rapid SOP approach achieves considerable speed-up over the standard i-vector with comparable equal error rates.

[1]  Oren Barkan,et al.  Efficient approximated i-vector extraction , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Haizhou Li,et al.  Rapid Computation of I-vector , 2016, Odyssey.

[3]  Pietro Laface,et al.  Memory and Computation Trade-Offs for Efficient I-Vector Extraction , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[5]  Haizhou Li,et al.  Sparse coding of total variability matrix , 2015, INTERSPEECH.

[6]  Longbiao Wang,et al.  PLDA in the i-supervector space for text-independent speaker verification , 2014, EURASIP J. Audio Speech Music. Process..

[7]  Andreas Stolcke,et al.  Speaker Recognition With Session Variability Normalization Based on MLLR Adaptation Transforms , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Haizhou Li,et al.  A GMM-based probabilistic sequence kernel for speaker verification , 2007, INTERSPEECH.

[9]  Rama Chellappa,et al.  Continuous User Authentication on Mobile Devices: Recent progress and remaining challenges , 2016, IEEE Signal Processing Magazine.

[10]  T. Kinnunen,et al.  Using Discrete Probabilities With Bhattacharyya Measure for SVM-Based Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[12]  Haizhou Li,et al.  Source-specific informative prior for i-vector extraction , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Haizhou Li,et al.  Gain Compensation for Fast i-Vector Extraction Over Short Duration , 2017, INTERSPEECH.

[14]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[15]  Kaare Brandt Petersen,et al.  The Matrix Cookbook , 2006 .

[16]  Patrick Kenny A small footprint i-vector extractor , 2012, Odyssey.

[17]  Themos Stafylakis,et al.  Uncertainty Modeling Without Subspace Methods For Text-Dependent Speaker Recognition , 2016, Odyssey.

[18]  Pietro Laface,et al.  Factorized Sub-Space Estimation for Fast and Memory Effective I-vector Extraction , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Patrick Kenny,et al.  Eigenvoice modeling with sparse training data , 2005, IEEE Transactions on Speech and Audio Processing.

[20]  Kong-Aik Lee,et al.  An extensible speaker identification sidekit in Python , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Themos Stafylakis,et al.  PLDA for speaker verification with utterances of arbitrary duration , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Seyed Omid Sadjadi,et al.  The IBM 2016 Speaker Recognition System , 2016, Odyssey.

[23]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[24]  Sandro Cumani,et al.  Exploiting i-vector posterior covariances for short-duration language recognition , 2015, INTERSPEECH.

[25]  Bin Ma,et al.  Multi-session PLDA scoring of i-vector for partially open-set speaker detection , 2013, INTERSPEECH.

[26]  Alvin F. Martin,et al.  The NIST 2010 speaker recognition evaluation , 2010, INTERSPEECH.

[27]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[29]  Themos Stafylakis,et al.  Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition , 2014, Odyssey.

[30]  Haizhou Li,et al.  Total Variability Modeling Using Source-Specific Priors , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[31]  Haizhou Li,et al.  Quasi-Factorial Prior for i-vector Extraction , 2015, IEEE Signal Processing Letters.

[32]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[33]  Patrick Kenny,et al.  Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.

[34]  Douglas A. Reynolds,et al.  Deep Neural Network Approaches to Speaker and Language Recognition , 2015, IEEE Signal Processing Letters.

[35]  Lukás Burget,et al.  Simplification and optimization of i-vector extraction , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).