A Distribution Free Formulation of the Total Variability Model

The Total Variability Model (TVM) [1] has been widely used in audio signal processing as a framework for capturing differences in feature space distributions across variable length sequences by mapping them into a fixed-dimensional representation. Its formulation requires making an assumption about the source data distribution being a Gaussian Mixture Model (GMM). In this paper, we show that it is possible to arrive at the same model formulation without requiring such an assumption about distribution of the data, by showing asymptotic normality of the statistics used to estimate the model. We highlight some connections between TVM and heteroscedastic Principal Component Analysis (PCA), as well as the matrix completion problem, which lead to a computationally efficient formulation of the Maximum Likelihood estimation problem for the model.

[1]  George Saon,et al.  Speaker adaptation of neural network acoustic models using i-vectors , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[2]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[4]  Vikas Sindhwani,et al.  Efficient and Practical Stochastic Subgradient Descent for Nuclear Norm Regularization , 2012, ICML.

[5]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[6]  Andrew W. Senior,et al.  Improving DNN speaker independence with I-vector inputs , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Shrikanth S. Narayanan,et al.  Non-Iterative Parameter Estimation for Total Variability Model Using Randomized Singular Value Decomposition , 2016, INTERSPEECH.

[8]  Emmanuel J. Candès,et al.  Matrix Completion With Noise , 2009, Proceedings of the IEEE.

[9]  Emmanuel J. Candès,et al.  Exact Matrix Completion via Convex Optimization , 2009, Found. Comput. Math..

[10]  Yun Lei,et al.  Application of Convolutional Neural Networks to Language Identification in Noisy Conditions , 2014, Odyssey.

[11]  Douglas A. Reynolds,et al.  Language Recognition via i-vectors and Dimensionality Reduction , 2011, INTERSPEECH.

[12]  Shrikanth S. Narayanan,et al.  Classification of cognitive load from speech using an i-vector framework , 2014, INTERSPEECH.

[13]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[14]  Lukás Burget,et al.  Language Recognition in iVectors Space , 2011, INTERSPEECH.

[15]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[16]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[18]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[19]  Themos Stafylakis,et al.  Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition , 2014, Odyssey.