Duration modeling in large vocabulary speech recognition

This paper presents a study of different methods for phoneme duration modeling in large vocabulary speech recognition. We investigate the employment of phoneme duration and the effect of context, speaking rate and lexical stress in the duration of phoneme segments in a large vocabulary speech recognition system. The duration models are used in a postprocessing phase of BYBLOS, our baseline HMM-based recognition system, to rescore the N-Best hypotheses. We describe experiments with the 5 K word ARPA Wall Street Journal (WSJ) corpus. The results show that integration of duration models that take into account context and speaking rate can improve the word accuracy of the baseline recognition system.

[1]  Stephen E. Levinson,et al.  Continuously variable duration hidden Markov models for automatic speech recognition , 1986 .

[2]  Chris Barry,et al.  Robust smoothing methods for discrete hidden Markov models , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[3]  M. Adda-Decker,et al.  Experiments on stress-dependent phone modelling for continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Francis Kubala,et al.  New uses for the N-Best sentence hypotheses within the BYBLOS speech recognition system , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  R. Moore,et al.  Explicit modelling of state occupancy in hidden Markov models for automatic speech recognition , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[7]  Philip C. Woodland,et al.  Using relative duration in large vocabulary speech recognition , 1993, EUROSPEECH.