Speaker adaptation for continuous density HMMs: a review

This paper reviews some popular speaker adaptation schemes that can be applied to continuous density hidden Markov models. These fall into three families based on MAP adaptation; linear transforms of model parameters such as maximum likelihood linear regression; and speaker clustering/speaker space methods such as eigenvoices. The strengths and weaknesses of each adaptation family are discussed along with extensions that have been proposed to improve the basic schemes which result in a number of hybrid approaches. A number of general extensions are discussed which include methods for improved unsupervised adaptation and discriminative adaptation. There is also a brief discussion of speaker normalisation and the relationship to model-based adaptation. The paper includes a brief discussion of other factors that directly interact with speaker adaptation of HMMs is included, such as adaptation to the acoustic environment and speaker-specific pronunciation dictionaries.

[1]  Roland Kuhn,et al.  Rapid speaker adaptation in eigenvoice space , 2000, IEEE Trans. Speech Audio Process..

[2]  Philip C. Woodland,et al.  Combined Bayesian and predictive techniques for rapid speaker adaptation of continuous density hidden Markov models , 1997, Comput. Speech Lang..

[3]  Jason J. Humphries Accent modelling and adaptation in automatic speech recognition , 1998 .

[4]  Mark J. F. Gales,et al.  Multiple-cluster adaptive training schemes , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[5]  Jean-Claude Junqua,et al.  Maximum likelihood eigenspace and MLLR for speech recognition in noisy environments , 1999, EUROSPEECH.

[6]  Ralf Kompe,et al.  A MAP-like weighting scheme for MLLR speaker adaptation , 1999, EUROSPEECH.

[7]  Philip C. Woodland,et al.  Speaker adaptation using lattice-based MLLR , 2001 .

[8]  Chin-Hui Lee,et al.  Maximum a posteriori linear regression for hidden Markov model adaptation , 1999, EUROSPEECH.

[9]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[10]  Stephen Cox,et al.  Predictive speaker adaptation in speech recognition , 1995, Comput. Speech Lang..

[11]  Vassilios Digalakis,et al.  A comparative study of speaker adaptation techniques , 1995, EUROSPEECH.

[12]  Gerhard Rigoll,et al.  Frame-discriminative and confidence-driven adaptation for LVCSR , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[13]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[14]  Mark J. F. Gales Cluster adaptive training of hidden Markov models , 2000, IEEE Trans. Speech Audio Process..

[15]  Vassilios Diakoloukas,et al.  Maximum likelihood stochastic transformation adaptation for medium and small data sets , 2001, Comput. Speech Lang..

[16]  Geoffrey Zweig,et al.  LATTICE-BASED UNSUPERVISED MLLR FOR SPEAKER ADAPTATION , 2000 .

[17]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[18]  Koichi Shinoda,et al.  Structural MAP speaker adaptation using hierarchical priors , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[19]  Li Lee,et al.  Speaker normalization using efficient frequency warping procedures , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[20]  Philip C. Woodland,et al.  An investigation into vocal tract length normalisation , 1999, EUROSPEECH.

[21]  Sadaoki Furui,et al.  A training procedure for isolated word recognition systems , 1980 .

[22]  William J. Byrne,et al.  Discounted likelihood linear regression for rapid speaker adaptation , 2001, Comput. Speech Lang..

[23]  Henrik Botterweck Very fast adaptation for large vocabulary continuous speech recognition using eigenvoices , 2000, INTERSPEECH.

[24]  Vassilios Digalakis,et al.  Speaker adaptation using constrained estimation of Gaussian mixtures , 1995, IEEE Trans. Speech Audio Process..

[25]  Wu Chou,et al.  Maximum a posterior linear regression with elliptically symmetric matrix variate priors , 1999, EUROSPEECH.

[26]  Mark J. F. Gales,et al.  Mean and variance adaptation within the MLLR framework , 1996, Comput. Speech Lang..

[27]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[28]  Robert I. Westwood,et al.  Speaker Adaptation Using Eigenvoices , 1999 .

[29]  Alexander H. Waibel,et al.  Recognition of conversational telephone speech using the JANUS speech engine , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[30]  Mark J. F. Gales Acoustic factorisation , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[31]  Tasos Anastasakos,et al.  The use of confidence measures in unsupervised adaptation of speech recognizers , 1998, ICSLP.

[32]  M. Picheny,et al.  New Adaptation Techniques for Large Vocabulary Continuous Speech Recognition , 2003 .

[33]  Mark J. F. Gales Transformation smoothing for speaker and environmental adaptation , 1997, EUROSPEECH.

[34]  Mark J. F. Gales,et al.  Iterative unsupervised adaptation using maximum likelihood linear regression , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[35]  William J. Byrne,et al.  Discriminative speaker adaptation with conditional maximum likelihood linear regression , 2001, INTERSPEECH.

[36]  Henrik Botterweck Anisotropic MAP defined by eigenvoices for large vocabulary continuous speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[37]  Philip C. Woodland,et al.  Experiments in speaker normalisation and adaptation for large vocabulary speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[38]  P. Woodland,et al.  Discriminative linear transforms for speaker adaptation , 2001 .

[39]  Hermann Ney,et al.  Improved MLLR speaker adaptation using confidence measures for conversational speech recognition , 2000, INTERSPEECH.

[40]  Tetsuo Kosaka,et al.  Tree-structured speaker clustering for fast speaker adaptation , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[41]  Daniel Povey,et al.  Large scale discriminative training for speech recognition , 2000 .

[42]  P. Woodland,et al.  Flexible speaker adaptation using maximum likelihood linear regression , 1995 .

[43]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .