Smoothed N-best-based speaker adaptation for speech recognition

Smoothed estimation and utterance verification are introduced into the N-best-based speaker adaptation method. That method is effective even for speakers whose decodings using speaker-independent (SI) models are error-prone, that is, for speakers for whom adaptation techniques are truly needed. The smoothed estimation improves the performance for such speakers, and the utterance verification reduces the required amount of calculation. Performance evaluation using connected-digit (four-digit strings) recognition experiments performed over actual telephone lines showed a reduction of 36.4% in the error rates for speakers whose decodings using SI models are error-prone. To try and find an effective model-transformation for speaker adaptation, we discuss replacing mixture-mean bias estimation by the widely used mixture-mean linear-regression-matrix estimation.

[1]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[2]  Masaaki Honda,et al.  Speaker adaptation algorithms based on piecewise moving adaptive segment quantization method , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[3]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[4]  Mitch Weintraub,et al.  An experimental study of acoustic adaptation algorithms , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[5]  T. Matsuoka Elaborate acoustic modeling for Japanese Connected digit recognition , 1995 .

[6]  Sadaoki Furui,et al.  N-best-based instantaneous speaker adaptation method for speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[7]  George Zavaliagkos,et al.  Batch, incremental and instantaneous adaptation techniques for speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[8]  Saduoki Furui Unsupervised speaker adaptation based on hierarchical spectral clustering , 1989, IEEE Trans. Acoust. Speech Signal Process..

[9]  Sadaoki Furui,et al.  Elaborate Acoustic Modeling for Japanese Connected Digit Recognition , 1995 .

[10]  Chin-Hui Lee,et al.  Implementation Aspects Of Large Vocabulary Recognition Based On Intraword And Interword Phonetic Units , 1990, HLT.