Investigation of acoustic modeling techniques for LVCSR systems

The paper describes the use of several advanced acoustic modeling techniques for the 2004 CU-HTK large vocabulary speech recognition systems. These techniques include Gaussianization for speaker normalization, discriminative cluster adaptive training (CAT), subspace for precision and mean (SPAM) modeling of inverse covariances, and discriminative complexity control. Acoustic models featuring these techniques were integrated into a state-of-the-art 10 real-time multi-pass system with sophisticated adaptation for performance evaluation. Experimental results are presented on both broadcast news (BN) and conversational telephone speech (CTS) transcription tasks.

[1]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[2]  Mark J. F. Gales,et al.  Maximum likelihood multiple projection schemes for hidden Markov models , 1999 .

[3]  Mark J. F. Gales,et al.  Adaptation of precision matrix models on large vocabulary continuous speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[4]  George Saon,et al.  Feature space Gaussianization , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Mark J. F. Gales,et al.  Training LVCSR systems on thousands of hours of data , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[6]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[7]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[8]  Mark J. F. Gales,et al.  Basis superposition precision matrix modelling for large vocabulary continuous speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Brian Kingsbury,et al.  Large vocabulary conversational speech recognition with a subspace constraint on inverse covariance matrices , 2003, INTERSPEECH.

[10]  Gunnar Evermann,et al.  Design of fast LVCSR systems , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[11]  Mark J. F. Gales,et al.  Automatic Model Complexity Control Using Marginalized Discriminative Growth Functions , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Mark J. F. Gales,et al.  Discriminative cluster adaptive training , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Mark J. F. Gales,et al.  Automatic model complexity control using marginalized discriminative growth functions , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[14]  Mark J. F. Gales,et al.  Mean and variance adaptation within the MLLR framework , 1996, Comput. Speech Lang..

[15]  Ramesh A. Gopinath,et al.  Gaussianization , 2000, NIPS.

[16]  Khe Chai Sim,et al.  Precision matrix modelling for large vocabulary continuous speech recognition , 2004 .

[17]  Mark J. F. Gales,et al.  Development of the CU-HTK 2004 broadcast news transcription systems , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[18]  Jorma Rissanen,et al.  The Minimum Description Length Principle in Coding and Modeling , 1998, IEEE Trans. Inf. Theory.

[19]  Mark J. F. Gales,et al.  Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[20]  Mark J. F. Gales Cluster adaptive training of hidden Markov models , 2000, IEEE Trans. Speech Audio Process..

[21]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.