A Fully Consistent Hidden Semi-Markov Model-Based Speech Recognition System

In a hidden Markov model (HMM), state duration probabilities decrease exponentially with time, which fails to adequately represent the temporal structure of speech. One of the solutions to this problem is integrating state duration probability distributions explicitly into the HMM. This form is known as a hidden semi-Markov model (HSMM). However, though a number of attempts to use HSMMs in speech recognition systems have been proposed, they are not consistent because various approximations were used in both training and decoding. By avoiding these approximations using a generalized forward-backward algorithm, a context-dependent duration modeling technique and weighted finite-state transducers (WFSTs), we construct a fully consistent HSMM-based speech recognition system. In a speaker-dependent continuous speech recognition experiment, our system achieved about 9.1% relative error reduction over the corresponding HMM-based system.

[1]  Cyril Allauzen,et al.  Generalized optimization algorithm for speech recognition transducers , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[2]  Heiga Zen,et al.  A Hidden Semi-Markov Model-Based Speech Synthesis System , 2007, IEICE Trans. Inf. Syst..

[3]  A. Cook,et al.  Experimental evaluation of duration modelling techniques for automatic speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Myoung-Wan Koo,et al.  Context-Dependent Phoneme Duration Modeling with Tree-Based State Tying , 2005, IEICE Trans. Inf. Syst..

[5]  Mari Ostendorf,et al.  From HMM's to segment models: a unified view of stochastic modeling for speech recognition , 1996, IEEE Trans. Speech Audio Process..

[6]  Fernando Pereira,et al.  Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[7]  H. Kobayashi,et al.  An efficient forward-backward algorithm for an explicit-duration hidden Markov model , 2003, IEEE Signal Processing Letters.

[8]  Keiichi Tokuda,et al.  Investigation of State Duration Model based on Gamma distribution for HMM-based Speech Synthesis , 2001 .

[9]  Venkata Ramana Rao Gadde Modeling word durations , 2000, INTERSPEECH.

[10]  Jj Odell,et al.  The Use of Context in Large Vocabulary Speech Recognition , 1995 .

[11]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[12]  B. Juang,et al.  Context-dependent Phonetic Hidden Markov Models for Speaker-independent Continuous Speech Recognition , 2008 .

[13]  Stephen E. Levinson,et al.  Continuously variable duration hidden Markov models for automatic speech recognition , 1986 .

[14]  Kai-Fu Lee,et al.  Context-independent phonetic hidden Markov models for speaker-independent continuous speech recognition , 1990 .