Applying word duration constraints by using unrolled HMMs

Conventional HMMs have weak duration constraints. In noisy conditions, the mismatch between corrupted speech signals and models trained on clean speech may cause the decoder to produce word matches with unrealistic durations. This paper presents a simple way to incorporate word duration constraints by unrolling HMMs to form a lattice where word duration probabilities can be applied directly to state transitions. The expanded HMMs are compatible with conventional Viterbi decoding. Experiments on connected-digit recognition show that when using explicit duration constraints the decoder generates word matches with more reasonable durations, and word error rates are significantly reduced across a broad range of noise conditions.

[1]  Kevin Power Durational modelling for improved connected digit recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[2]  T. Crystal,et al.  Segmental durations in connected‐speech signals: Current results , 1988 .

[3]  R. Moore,et al.  Explicit modelling of state occupancy in hidden Markov models for automatic speech recognition , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Frank K. Soong,et al.  High performance connected digit recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[5]  Jon Barker,et al.  Robust ASR based on clean speech models: an evaluation of missing data techniques for connected digit recognition in noise , 2001, INTERSPEECH.

[6]  J C Junqua,et al.  The Lombard reflex and its role on human listeners and automatic speech recognizers. , 1993, The Journal of the Acoustical Society of America.

[7]  CookeMartin,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001 .

[8]  David Burshtein Robust parametric modeling of durations in hidden Markov models , 1996, IEEE Trans. Speech Audio Process..

[9]  Steve Renals,et al.  Efficient evaluation of the LVCSR search space using the NOWAY decoder , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[10]  T. Crystal,et al.  Segmental durations in connected-speech signals: Syllabic stress , 1988 .

[11]  Mary P. Harper,et al.  On the complexity of explicit duration HMM's , 1995, IEEE Trans. Speech Audio Process..

[12]  Ning Ma,et al.  Context-dependent word duration modelling for robust speech recognition , 2005, INTERSPEECH.