Discriminative speaker adaptation in Persian continuous speech recognition systems

Abstract In this paper, the use of discriminative criteria such as minimum phone error (MPE) and maximum mutual information (MMI) is investigated for discriminative training HMM models for Persian speech recognition system. Discriminative training criteria have been successfully used to train acoustic models, so these criteria are expected to improve the estimation of linear transforms for speaker adaptation. MPE criterion is used to estimate the discriminative linear transforms (DLTs) for mean transforms. Experiments on Farsdat corpus show considerable improvements of discriminative training against ML trained models and MPE training outperforms MMI training on test data. Furthermore, MPE-based DLT reduces the word error rate in comparison to MLLR adaptation.

[1]  Steve J. Young,et al.  MMIE training of large vocabulary recognition systems , 1997, Speech Communication.

[2]  Philip C. Woodland,et al.  Improvements in linear transform based speaker adaptation , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[3]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[4]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[6]  Mark J. F. Gales,et al.  Unsupervised Adaptation With Discriminative Mapping Transforms , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Lan Wang,et al.  MPE-based discriminative linear transforms for speaker adaptation , 2008, Comput. Speech Lang..

[8]  Daniel Povey,et al.  Large scale discriminative training of hidden Markov models for speech recognition , 2002, Comput. Speech Lang..