Large vocabulary conversational speech recognition with a subspace constraint on inverse covariance matrices

This paper applies the recently proposed SPAM models for acoustic modeling in a Speaker Adaptive Training (SAT) context on large vocabulary conversational speech databases, including the Switchboard database. SPAM models are Gaussian mixture models in which a subspace constraint is placed on the precision and mean matrices (although this paper focuses on the case of unconstrained means). They include diagonal covariance, full covariance, MLLT, and EMLLT models as special cases. Adaptation is carried out with maximum likelihood estimation of the means and feature-space under the SPAM model. This paper shows the first experimental evidence that the SPAM models can achieve significant word-error-rate improvements over state-of-the-art diagonal covariance models, even when those diagonal models are given the benefit of choosing the optimal number of Gaussians (according to the Bayesian Information Criterion). This paper also is the first to apply SPAM models in a SAT context. All experiments are performed on the IBM “Superhuman” speech corpus which is a challenging and diverse conversational speech test set that includes the Switchboard portion of the 1998 Hub5e evaluation data set.

[1]  Andreas Stolcke,et al.  Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[2]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[3]  S. Wegmann,et al.  Speaker normalization on conversational telephone speech , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[4]  Ramesh A. Gopinath,et al.  Model selection in acoustic modeling , 1999, EUROSPEECH.

[5]  Scott Axelrod,et al.  Dimensional reduction, covariance modeling, and computational complexity in ASR systems , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[6]  Jing Huang,et al.  Large vocabulary conversational speech recognition with the extended maximum likelihood linear transformation (EMLLT) model , 2002, INTERSPEECH.

[7]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[8]  Geoffrey Zweig,et al.  Toward domain-independent conversational speech recognition , 2003, INTERSPEECH.

[9]  Andreas Stolcke,et al.  The Meeting Project at ICSI , 2001, HLT.

[10]  Ramesh A. Gopinath,et al.  Maximum likelihood modeling with Gaussian distributions for classification , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[11]  Scott Axelrod,et al.  Maximum likelihood training of subspaces for inverse covariance modeling , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[12]  Scott Axelrod,et al.  Modeling with a subspace constraint on inverse covariance matrices , 2002, INTERSPEECH.

[13]  Jing Huang,et al.  Automatic speech recognition performance on a voicemail transcription task , 2002, IEEE Trans. Speech Audio Process..

[14]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[15]  Mark J. F. Gales,et al.  Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[16]  Peder A. Olsen,et al.  Modeling inverse covariance matrices by basis expansion , 2002, IEEE Transactions on Speech and Audio Processing.