MODELING WORD DURATION FOR BETTER SPEECH RECOGNITION

We describe a new method of modeling duration at word level. These duration models are easily trained from the acoustic training data and can be used to rescore N−best lists of recognition hypotheses. The models capture some of the well known durational effects such as prepausal lengthening. They incorporate a simple back off mechanism to handle unseen words during rescoring. Experiments with various large vocabulary conversational speech recognition (LVCSR) evaluation sets showed consistent improvements of 0.7−1.0% in word error rate (WER).

[1]  Andreas Stolcke,et al.  THE SRI MARCH 2000 HUB-5 CONVERSATIONAL SPEECH TRANSCRIPTION SYSTEM , 2000 .

[2]  Gökhan Tür,et al.  Modeling the prosody of hidden events for improved word recognition , 1999, EUROSPEECH.

[3]  Mari Ostendorf,et al.  Probabilistic parse scoring with prosodic information , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Stephanie Seneff,et al.  A hierarchical duration model for speech recognition based on the ANGIE framework , 1999, Speech Commun..

[5]  Fergus McInnes,et al.  Use of acoustic sentence level and lexical stress in HSMM speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Andrew Hunt A generalised model for utilising prosodic information in continuous speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.