论文信息 - A combination of speaker normalization and speech rate normalization for automatic speech recognition

A combination of speaker normalization and speech rate normalization for automatic speech recognition

In this contribution a normalization procedure for automatic speech recognition is introduced which aims at reducing speaking rate specific variations of the features of the phonetic classes. A “spurtwise” calculation of normalization factors allows to capture changes of the speaking rate within one utterance. The costsaving implementation using linear interpolation of the original features and a word graph rescoring procedure leads to a moderate increase in computational load compared to the baseline system without speech rate normalization. In addition a two-step procedure which combines vocal tract length normalization (VTLN) and speech rate normalization (SRN) has been developed. Experiments showed, that applying SRN to a VTLN-based recognition system leads to relative reduction in word error rate of 4.2%. This is comparable to the decrease observed when using SRN on a system without VTLN. All in all the combination of VTLN and SRN results in a 15% reduction of word error rate compared to the baseline system.

Thilo Pfau | Robert Faltlhauser | Günther Ruske

[1] Puming Zhan,et al. Speaker normalization based on frequency warping , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2] Thilo Pfau,et al. Estimating the speaking rate by vowel detection , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[3] Richard M. Stern,et al. On the effects of speech rate in large vocabulary speech recognition systems , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[4] Eric Fosler-Lussier,et al. Fast speakers in large vocabulary continuous speech recognition: analysis & antidotes , 1995, EUROSPEECH.

[5] Eric Fosler-Lussier,et al. Speech recognition using on-line estimation of speaking rate , 1997, EUROSPEECH.

[6] Eric Fosler-Lussier,et al. Combining multiple estimators of speaking rate , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[7] Herbert Gish,et al. A parametric approach to vocal tract length normalization , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[8] Hermann Ney,et al. Recent improvements of the RWTH large vocabulary speech recognition system on spontaneous speech , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[9] Fosler-Lussier,et al. EFFECTS OF SPEAKING RATE AND WORD FREQUENCY ONCONVERSATIONAL PRONUNCIATIONSEric , 1999 .

[10] Tom Brøndsted,et al. Analysis of speaking rate variations in stress-timed languages , 1997, EUROSPEECH.

[11] Mei-Yuh Hwang,et al. Improvements on speech recognition for fast talkers , 1999, EUROSPEECH.

[12] Daniel Tapias Merino,et al. Towards speech rate independence in large vocabulary continuous speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[13] Eric Fosler-Lussier,et al. Towards robustness to fast speech in ASR , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[14] Thilo Pfau,et al. Creating hidden Markov models for fast speech , 1998, ICSLP.

[15] Thilo Pfau,et al. Speaker normalization and pronunciation variant modeling: helpful methods for improving recognition of fast speech , 1999, EUROSPEECH.

[16] William J. Byrne,et al. Speaker normalization with all-pass transforms , 1998, ICSLP.