Unsupervised Adaptation With Discriminative Mapping Transforms

The most commonly used approaches to speaker adaptation are based on linear transforms, as these can be robustly estimated using limited adaptation data. Although significant gains can be obtained using discriminative criteria for training acoustic models, maximum-likelihood (ML) estimated transforms are still used for unsupervised adaptation. This is because discriminatively trained transforms are highly sensitive to errors in the adaptation supervision hypothesis. This paper describes a new framework for estimating transforms that are discriminative in nature, but are less sensitive to this hypothesis issue. A speaker-independent discriminative mapping transformation (DMT) is estimated during training. This transform is obtained after a speaker-specific ML-estimated transform of each training speaker has been applied. During recognition an ML speaker-specific transform is found for each test-set speaker and the speaker-independent DMT then applied. This allows a transform which is discriminative in nature to be indirectly estimated, while only requiring an ML speaker-specific transform to be found during recognition. The DMT technique is evaluated on an English conversational telephone speech task. Experiments showed that using DMT in unsupervised adaptation led to significant gains over both standard ML and discriminatively trained transforms.

[1]  Stephen Cox,et al.  Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[2]  P. Woodland,et al.  Flexible speaker adaptation using maximum likelihood linear regression , 1995 .

[3]  Geoffrey Zweig,et al.  fMPE: discriminatively trained features for speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[4]  William J. Byrne,et al.  Discriminative linear transforms for feature normalization and speaker adaptation in HMM estimation , 2005, IEEE Transactions on Speech and Audio Processing.

[5]  Geoffrey Zweig,et al.  The IBM 2004 conversational telephony system for rich transcription , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[6]  Alexander H. Waibel,et al.  On maximum mutual information speaker-adapted training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Mark J. F. Gales,et al.  Adaptive training using discriminative mapping transforms , 2008, INTERSPEECH.

[8]  Mark J. F. Gales,et al.  Unsupervised discriminative adaptation using discriminative mapping transforms , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Geoffrey Zweig,et al.  LATTICE-BASED UNSUPERVISED MLLR FOR SPEAKER ADAPTATION , 2000 .

[11]  Steve J. Young,et al.  MMIE training of large vocabulary recognition systems , 1997, Speech Communication.

[12]  Richard M. Schwartz,et al.  Discriminatively Trained Region Dependent Feature Transforms for Speech Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[13]  Mark J. F. Gales,et al.  Discriminative cluster adaptive training , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  S. Young,et al.  Lattice-based discriminative training for large vocabulary speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[15]  Tasos Anastasakos,et al.  The use of confidence measures in unsupervised adaptation of speech recognizers , 1998, ICSLP.

[16]  Gerhard Rigoll,et al.  Frame-discriminative and confidence-driven adaptation for LVCSR , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[17]  P. Woodland,et al.  Discriminative linear transforms for speaker adaptation , 2001 .

[18]  Lan Wang,et al.  MPE-based discriminative linear transforms for speaker adaptation , 2008, Comput. Speech Lang..

[19]  Vassilios Digalakis,et al.  Speaker adaptation using constrained estimation of Gaussian mixtures , 1995, IEEE Trans. Speech Audio Process..

[20]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[21]  William J. Byrne,et al.  Discriminative speaker adaptation with conditional maximum likelihood linear regression , 2001, INTERSPEECH.

[22]  Philip C. Woodland,et al.  Discriminative adaptive training using the MPE criterion , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[23]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[24]  Richard M. Schwartz,et al.  Recent progress on the discriminative region-dependent transform for speech feature extraction , 2006, INTERSPEECH.

[25]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[26]  Régis Cardin,et al.  MMIE training for large vocabulary continuous speech recognition , 1994, ICSLP.

[27]  Jonathan Le Roux,et al.  Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[29]  Philip C. Woodland,et al.  Speaker adaptation using lattice-based MLLR , 2001 .

[30]  Mark J. F. Gales,et al.  Iterative unsupervised adaptation using maximum likelihood linear regression , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[31]  Daniel Povey,et al.  Large scale discriminative training of hidden Markov models for speech recognition , 2002, Comput. Speech Lang..