Stochastic trajectory model with state-mixture for continuous speech recognition

The problem of acoustic modeling for continuous speech recognition is addressed. To deal with coarticulation effects and interspeaker variability, an extension of the mixture stochastic trajectory model (MSTM) is proposed. MSTM is a segment-based model using phonemes as speech units. In MSTM, the observations of a phoneme are modeled by a set of stochastic trajectories. The trajectories are modeled by a mixture of probability density functions (pdf) of state sequences. Each state is associated with a multivariate Gaussian density function. We propose to replace the state single Gaussian pdf by a mixture of Gaussian pdfs (MSTM with state-mixture, SM-MSTM). The parameters of the model are estimated under the ML criterion, using the expectation-maximisation (EM) algorithm. The tests of the system on a speaker-dependent continuous speech recognition task show a reduction in the word error rate by about 15% over the baseline MSTM, even for an equal number of parameters. Experiments based on a multispeaker continuous speech recognition task do not lead to significant improvement over the baseline system.

[1]  Yifan Gong,et al.  Stochastic trajectory modeling for speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Lori Lamel,et al.  The LIMSI continuous speech dictation system: evaluation on the ARPA Wall Street Journal task , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Steve J. Young,et al.  State clustering in hidden Markov model-based continuous speech recognition , 1994, Comput. Speech Lang..

[4]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[5]  Yifan Gong,et al.  Modeling long term variability information in a mixture stochastic trajectory framework , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[6]  Yifan Gong,et al.  Stochastic trajectory models for speech recognition: an extension to modelling time correlation , 1995, EUROSPEECH.

[7]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[8]  Yifan Gong,et al.  A semi-continuous stochastic trajectory model for phoneme-based continuous speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[9]  Jean-François Mari,et al.  Issues in acoustic modeling of speech for automatic speech recognition , 1994 .