Structural speaker adaptation using maximum a posteriori approach and a Gaussian distributions merging technique

The aim of speaker adaptation techniques is to enhance speaker-independent acoustic models to bring their recognition accuracy as close as possible to the one obtained with speaker-dependent models. Recently, a technique based on a hierarchical structure and the maximum a posteriori criterion was proposed (SMAP) (Shinoda, K. and Lee, C.-H., Proc IEEE ICASSP, 1998). As in SMAP, we assume that the acoustic model parameters are organized in a tree containing all the Gaussian distributions. Each node in that tree represents a cluster of Gaussian distributions sharing a common affine transformation representing the mismatch between training and test conditions. To estimate this affine transformation, we propose a new technique based on merging Gaussians and the standard MAP adaptation. This new technique is very fast and allows a good unsupervised adaptation for both means and variances even with a small amount of adaptation data. This adaptation strategy has shown a significant performance improvement in a large vocabulary speech recognition task, alone and combined with the MLLR (maximum likelihood linear regression) adaptation.

[1]  Vassilios Digalakis,et al.  Speaker adaptation using combined transformation and Bayesian methods , 1996, IEEE Trans. Speech Audio Process..

[2]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[3]  Chin-Hui Lee,et al.  On stochastic feature and model compensation approaches to robust speech recognition , 1998, Speech Commun..

[4]  Chin-Hui Lee,et al.  Unsupervised adaptation using structural Bayes approach , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[5]  Georges Linarès,et al.  A posteriori and a priori transformations for speaker adaptation in large vocabulary speech recognition systems , 2001, INTERSPEECH.

[6]  Georges Linarès,et al.  Phoneme Lattice Based A* Search Algorithm for Speech Recognition , 2002, TSD.

[7]  Richard M. Stern,et al.  Automatic clustering and generation of contextual questions for tied states in hidden Markov models , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[8]  Maxine Eskénazi,et al.  BREF, a large vocabulary spoken corpus for French , 1991, EUROSPEECH.

[9]  Chin-Hui Lee,et al.  Structural maximum a posteriori linear regression for fast HMM adaptation , 2002, Comput. Speech Lang..

[10]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[11]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.