Gaussian Mixture Model Weight Supervector Decomposition and Adaptation

This report proposes a novel approach for Gaussian Mixture Model (GMM) weights decomposition and adaptation. This modeling suggests a new low-dimensional utterance representation method, which uses a simple factor analysis similar to that of the i-vector framework. The suggested approach is applied to the Robust Automatic Transcription of Speech (RATS) language identification evaluation corpus, where the speech recordings are from highly degraded communication channels. In our experiments, after modeling each utterance using the proposed approach, a Deep Belief Networks (DBN) is utilized to recognize the language of utterances. The assessment results show that the proposed method improves conventional maximum likelihood weight adaptation. It is also shown that the absolute and relative improvement obtained by the score-level fusion of the i-vector framework and the proposed method are 5% and 17% respectively.

[1]  Lukás Burget,et al.  Advances in phonotactic language recognition , 2008, INTERSPEECH.

[2]  Lukás Burget,et al.  Prosodic speaker verification using subspace multinomial models with intersession compensation , 2010, INTERSPEECH.

[3]  J. Nocedal,et al.  A Limited Memory Algorithm for Bound Constrained Optimization , 1995, SIAM J. Sci. Comput..

[4]  Honglak Lee,et al.  Sparse deep belief net model for visual area V2 , 2007, NIPS.

[5]  Douglas A. Reynolds,et al.  Language Recognition via i-vectors and Dimensionality Reduction , 2011, INTERSPEECH.

[6]  Hugo Van hamme,et al.  Age Estimation from Telephone Speech using i-vectors , 2012, INTERSPEECH.

[7]  Hugo Van hamme,et al.  Rapid speaker adaptation in latent speaker space with non-negative matrix factorization , 2013, Speech Commun..

[8]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Lukás Burget,et al.  Discriminative classifiers for phonotactic language recognition with iVectors , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Sankar K. Pal,et al.  Multilayer perceptron, fuzzy sets, and classification , 1992, IEEE Trans. Neural Networks.

[11]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[12]  Jan A Snyman,et al.  Practical Mathematical Optimization: An Introduction to Basic Optimization Theory and Classical and New Gradient-Based Algorithms , 2005 .

[13]  Shrikanth S. Narayanan,et al.  Automatic speaker age and gender recognition using acoustic and prosodic level information fusion , 2013, Comput. Speech Lang..

[14]  Lukás Burget,et al.  iVector Approach to Phonotactic Language Recognition , 2011, INTERSPEECH.

[15]  Hugo Van hamme,et al.  Accent recognition using i-vector, Gaussian Mean Supervector and Gaussian posterior probability supervector for spontaneous telephone speech , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.