A Study on Speaker-Adaptive Speech Recognition

Speaker-independent system is desirable in many applications where speaker-specific data do not exist. However, if speaker-dependent data are available, the system could be adapted to the specific speaker such that the error rate could be significantly reduced. In this paper, DARPA Resource Management task is used as the domain to investigate the performance of speaker-adaptive speech recognition. Since adaptation is based on speaker-independent systems with only limited adaptation data, a good adaptation algorithm should be consistent with the speaker-independent parameter estimation criterion, and adapt those parameters that are less sensitive to the limited training data. Two parameter sets, the codebook mean vector and the output distribution, are regarded to be most important. They are modified in the framework of maximum likelihood estimation criterion according to the characteristics of each speaker. In order to reliably estimate those parameters, output distributions are shared with each other if they exhibit certain acoustic similarity. In addition to modify these parameters, speaker normalization with neural networks is also studied in the hope that acoustic data normalization will not only rapidly adapt the system but also enhance the robustness of speaker-independent speech recognition. Preliminary results indicate that speaker differences can be well minimized. In comparison with speaker-independent speech recognition, the error rate has been reduced from 4.3% to 3.1% by only using parameter adaptation techniques, with 40 adaptation sentences for each speaker. When the number of speaker adaptation sentences is comparable to that of speaker-dependent training, speaker-adaptive recognition works better than the best speaker-dependent recognition results on the same test set, which indicates the robustness of speaker-adaptive speech recognition.

[1]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[2]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[3]  Chin-Hui Lee,et al.  Bayesian adaptation in speech recognition , 1983, ICASSP.

[4]  Richard M. Stern,et al.  Dynamic speaker adaptation for isolated letter recognition using MAP estimation , 1983, ICASSP.

[5]  John Makhoul,et al.  Context-dependent modeling for acoustic-phonetic recognition of continuous speech , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[7]  Kiyohiro Shikano,et al.  Speaker adaptation through vector quantization , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Raj Reddy,et al.  Automatic Speech Recognition: The Development of the Sphinx Recognition System , 1988 .

[9]  D. B. Paul,et al.  Speaker stress-resistant continuous speech recognition , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[10]  Masafumi Nishimura,et al.  Speaker adaptation method for HMM-based speech recognition , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[11]  D. B. Paul,et al.  The Lincoln robust continuous speech recognizer , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[12]  Satoshi Nakamura,et al.  Speaker adaptation applied to HMM and neural networks , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[13]  Richard M. Schwartz,et al.  Speaker Adaptation Using Multiple Reference Speakers , 1989, HLT.

[14]  Kai-Fu Lee,et al.  Interword coarticulation modeling for continuous speech recognition , 1989 .

[15]  Mei-Yuh Hwang,et al.  Improved Hidden Markov Modeling for Speaker-Independent Continuous Speech Recognition , 1990, HLT.

[16]  Richard M. Stern,et al.  Environmental robustness in automatic speech recognition , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[17]  Hsiao-Wuen Hon,et al.  An overview of the SPHINX speech recognition system , 1990, IEEE Trans. Acoust. Speech Signal Process..

[18]  Aaron E. Rosenberg,et al.  Improved Acoustic Modeling for Continuous Speech Recognition , 1990, HLT.

[19]  J. Picone,et al.  Continuous speech recognition using hidden Markov models , 1990, IEEE ASSP Magazine.

[20]  Xuedong Huang,et al.  Semi-continuous hidden Markov models for speech signals , 1990 .

[21]  Richard M. Schwartz,et al.  A New Paradigm for Speaker-Independent Training and Speaker Adaptation , 1990, HLT.

[22]  Kai-Fu Lee,et al.  Context-independent phonetic hidden Markov models for speaker-independent continuous speech recognition , 1990 .

[23]  Mei-Yuh Hwang,et al.  Improved acoustic modeling with the SPHINX speech recognition system , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[24]  Biing-Hwang Juang,et al.  Hidden Markov Models for Speech Recognition , 1991 .

[25]  Alejandro Acero,et al.  Acoustical and environmental robustness in automatic speech recognition , 1991 .

[26]  K.F. Lee,et al.  On speaker-independent, speaker-dependent, and speaker-adaptive speech recognition , 1993, IEEE Trans. Speech Audio Process..