Discriminative cluster adaptive training

Multiple-cluster schemes, such as cluster adaptive training (CAT) or eigenvoice systems, are a popular approach for rapid speaker and environment adaptation. Interpolation weights are used to transform a multiple-cluster, canonical, model to a standard hidden Markov model (HMM) set representative of an individual speaker or acoustic environment. Maximum likelihood training for CAT has previously been investigated. However, in state-of-the-art large vocabulary continuous speech recognition systems, discriminative training is commonly employed. This paper investigates applying discriminative training to multiple-cluster systems. In particular, minimum phone error (MPE) update formulae for CAT systems are derived. In order to use MPE in this case, modifications to the standard MPE smoothing function and the prior distribution associated with MPE training are required. A more complex adaptive training scheme combining both interpolation weights and linear transforms, a structured transform (ST), is also discussed within the MPE training framework. Discriminatively trained CAT and ST systems were evaluated on a state-of-the-art conversational telephone speech task. These multiple-cluster systems were found to outperform both standard and adaptively trained systems

[1]  Yves Normandin,et al.  Hidden Markov models, maximum mutual information estimation, and the speech recognition problem , 1992 .

[2]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[3]  Mark J. F. Gales Cluster adaptive training of hidden Markov models , 2000, IEEE Trans. Speech Audio Process..

[4]  Mark J. F. Gales,et al.  Adaptive training using structured transforms , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Biing-Hwang Juang,et al.  Minimum classification error rate methods for speech recognition , 1997, IEEE Trans. Speech Audio Process..

[7]  Henrik Botterweck Very fast adaptation for large vocabulary continuous speech recognition using eigenvoices , 2000, INTERSPEECH.

[8]  Alex Acero,et al.  Spoken Language Processing , 2001 .

[9]  Philip C. Woodland,et al.  Discriminative adaptive training using the MPE criterion , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[10]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[11]  Mark J. F. Gales Cluster adaptive training for speech recognition , 1998, ICSLP.

[12]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Stephen Cox,et al.  Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[14]  William J. Byrne,et al.  Discriminative speaker adaptation with conditional maximum likelihood linear regression , 2001, INTERSPEECH.

[15]  Mark J. F. Gales,et al.  MMI-MAP and MPE-MAP for acoustic model adaptation , 2003, INTERSPEECH.

[16]  Alexander H. Waibel,et al.  On maximum mutual information speaker-adapted training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[18]  Patrick Nguyen Self-adaptation using eigenvoices for large-vocabulary continuous speech recognition , 2001 .

[19]  Jonathan Le Roux,et al.  Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Mark J. F. Gales,et al.  Multiple-cluster adaptive training schemes , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[21]  Jean-Claude Junqua,et al.  Maximum likelihood eigenspace and MLLR for speech recognition in noisy environments , 1999, EUROSPEECH.

[22]  Daniel Povey,et al.  Large scale discriminative training of hidden Markov models for speech recognition , 2002, Comput. Speech Lang..

[23]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[24]  Roland Kuhn,et al.  Rapid speaker adaptation in eigenvoice space , 2000, IEEE Trans. Speech Audio Process..

[25]  Roland Kuhn,et al.  Eigenvoices for speaker adaptation , 1998, ICSLP.