Hidden Semi-Markov Model Based Speech Recognition System using Weighted Finite-State Transducer

In hidden Markov models (HMMs), state duration probabilities decrease exponentially with time. It would be an inappropriate representation of temporal structure of speech. One of the solutions for this problem is integrating state duration probability distributions explicitly into the HMM. This form is known as a hidden semi-Markov model (HSMM). Although a number of attempts to use explicit duration models in speech recognition systems have been proposed, they are not consistent because various approximations were used in both training and decoding. In the present paper, a fully consistent speech recognition system based on the HSMM framework is proposed. In a speaker-dependent continuous speech recognition experiment, HSMM-based speech recognition system achieved about 5.9% relative error reduction over the corresponding HMM-based one

[1]  Stephen E. Levinson,et al.  Continuously variable duration hidden Markov models for automatic speech recognition , 1986 .

[2]  Jj Odell,et al.  The Use of Context in Large Vocabulary Speech Recognition , 1995 .

[3]  Mari Ostendorf,et al.  From HMM's to segment models: a unified view of stochastic modeling for speech recognition , 1996, IEEE Trans. Speech Audio Process..

[4]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[5]  Venkata Ramana Rao Gadde Modeling word durations , 2000, INTERSPEECH.

[6]  Hans J. G. A. Dolfing,et al.  Incremental language models for speech recognition using finite-state transducers , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[7]  Shigeru Katagiri,et al.  Recent advances in efficient decoding combining on-line transducer composition and smoothed language model incorporation , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Fernando Pereira,et al.  Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[9]  Cyril Allauzen,et al.  Generalized optimization algorithm for speech recognition transducers , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[10]  Myoung-Wan Koo,et al.  Context-Dependent Phoneme Duration Modeling with Tree-Based State Tying , 2005, IEICE Trans. Inf. Syst..