Latent variable speaker adaptation of Gaussian mixture weights and means

We describe a novel fast speaker adaptation algorithm for large vocabulary speech recognition systems, which adapts both the Gaussian means and the mixture weights. Gaussian means are expressed as a linear combination of eigenvoices estimated with principal component analysis. The non-negative Gaussian mixture weights are expressed as a linear combination of a set of latent vectors estimated with non-negative matrix factorization. Experiments on the Wall Street Journal database show that the combination of weight and mean adaptation consistently improves the performance compared to eigenvoice adaptation only. Improvements up to 5.8% relative word error rate reduction were observed with 40 eigenvoices and 40 latent weight vectors. Furthermore, combining weight and mean adaptation outperformed both weight and mean adaptation on itself, even if the latter uses more latent vectors.

[1]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[2]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[3]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[4]  Kaisheng Yao,et al.  A basis method for robust estimation of constrained MLLR , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Jean-Claude Junqua,et al.  Maximum likelihood eigenspace and MLLR for speech recognition in noisy environments , 1999, EUROSPEECH.

[6]  Hugo Van hamme,et al.  Rapid speaker adaptation with speaker adaptive training and non-negative matrix factorization , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[8]  Tony Robinson,et al.  A new frequency shift function for reducing inter-speaker variance , 1993, EUROSPEECH.

[9]  R. Schwartz,et al.  Maximum a posteriori adaptation for large scale HMM recognizers , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[10]  Roland Kuhn,et al.  Rapid speaker adaptation in eigenvoice space , 2000, IEEE Trans. Speech Audio Process..