Model‐based unsupervised instantaneous speaker adaptation

Unsupervised instantaneous adaptation, which uses the input utterance itself for adaptation, is the ideal speaker adaptation method for speech recognition, and is expected to be very useful for a wide range of applications. Since voice individuality is phoneme dependent, speaker adaptation must be performed model dependently. However, it is impossible to obtain a complete model sequence, that is, what is spoken, for each input utterance, especially for speakers who have many recognition errors when using speaker‐independent models. Therefore, how to perform model‐dependent adaptation without knowing the correct model sequence is a crucial issue. If all possible model sequences were hypothesized and used for adaptation, the amount of calculation would become enormous. This paper proposes a new adaptation method, in which N‐best hypotheses are created by applying speaker‐independent phone models to each input utterance, and speaker adaptation based on a constrained MAP estimation technique is then applied t...