Modeling long term variability information in a mixture stochastic trajectory framework

The problem of acoustic modeling for speech recognizers is addressed. We distinguish two types of speech variability, long term (speaker identity, stationary noise, channel distortion) and short term (phoneme class). Currently, most recognizers model the two variabilities without considering their specificities, which may result in flat distributions with limited discriminability. In our system, the long term variability (environment) is modeled by a mixture model, where each mixture is modeled by a mixture stochastic trajectory model (MSTM). We propose the environment dependent mixture stochastic trajectory model (ED-MSTM) to model a set of environments. The parameters of ED-MSTM are estimated using the maximum likelihood (ML) estimation criterion by the expectation-maximisation (EM) algorithm. Our model has been tested on a 1011 word vocabulary, multi-speaker continuous French recognition task with noisy speech. In the experiments, we assume that speakers can be grouped into a pre-determined number of classes and that the class label of a speaker is missing. The use of environmental modeling cut down the error rate produced by the multi-speaker system by about 15%, which is a statistically significant improvement. The idea of environment modeling is applicable to other acoustic modeling techniques such as hidden Markov models.

[1]  Yifan Gong,et al.  Iterative transformation and alignment for speech labeling , 1993, EUROSPEECH.

[2]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[3]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[4]  Yifan Gong,et al.  Stochastic trajectory model with state-mixture for continuous speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[5]  Yifan Gong,et al.  Stochastic trajectory modeling for speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Yifan Gong,et al.  Stochastic trajectory models for speech recognition: an extension to modelling time correlation , 1995, EUROSPEECH.

[7]  Yifan Gong,et al.  A semi-continuous stochastic trajectory model for phoneme-based continuous speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[8]  Lori Lamel,et al.  The LIMSI continuous speech dictation system: evaluation on the ARPA Wall Street Journal task , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.