An Acoustic Modeling Method Robust against Changes of Speaking Style in Error Recovery

Synopsis The performance of automatic speech recognition system is not enough to perfectly recognize speech and users must rephrase their input utterances when recognition errors occur. However, speaking styles of error recovery utterances are significantly changed from that of normal utterances. Because of these changes, the performance of speech recognition system is degraded to the contrary. In Japanese, especially, the occurrence of syllable-stressed utterances increases in error recovery. In this paper, we propose an acoustic modeling method that is robust to syllable-stressed speech. In syllable-stressed utterances, each syllable is uttered in a manner between a continuous utterance and an isolated syllable utterance and the continuity of each syllable is changed. To handle the acoustic characteristics of syllables uttered like an isolated syllable, we use vowel triphone models taken from existing acoustic models, which are succeeded by silence. To handle the change of continuity of each syllable, we use left-context dependent vowel biphone models, which are generated by using the same training data as that used for creating the existing acoustic models. These models and conventional triphone acoustic models compose a single acoustic model by taking a multi-path approach. Our method improves the performance of speech recognition systems for syllable-stressed speech without any need to collect additional training data. In addition, it provides a slightly better performance for normal utterances.

[1]  Hagen Soltau,et al.  Specialized acoustic models for hyperarticulated speech , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[2]  Atsushi Nakamura,et al.  Japanese speech databases for robust speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[3]  Hagen Soltau,et al.  On the influence of hyperarticulated speech on recognition performance , 1998, ICSLP.

[4]  Mari Ostendorf,et al.  HMM topology design using maximum likelihood successive state splitting , 1997, Comput. Speech Lang..

[5]  Yoshinori Sagisaka,et al.  Multiclass composite N-gram language model based on connection direction , 2003, Systems and Computers in Japan.

[6]  Sharon L. Oviatt The CHAM model of hyperarticulate adaptation during human-computer error resolution , 1998, ICSLP.

[7]  Tetsuo Kosaka,et al.  Speaker adaptation based on transfer vector field smoothing using maximum a posteriori probability estimation , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.