Target-directed mixture dynamic models for spontaneous speech recognition

In this paper, a novel mixture linear dynamic model (MLDM) for speech recognition is developed and evaluated, where several linear dynamic models are combined (mixed) to represent different vocal-tract-resonance (VTR) dynamic behaviors and the mapping relationships between the VTRs and the acoustic observations. Each linear dynamic model is formulated as the state-space equations, where the VTRs target-directed property is incorporated in the state equation and a linear regression function is used for the observation equation that approximates the nonlinear mapping relationship. A version of the generalized EM algorithm is developed for learning the model parameters, where the constraint that the VTR targets change at the segmental level (rather than at the frame level) is imposed in the parameter learning and model scoring algorithms. Speech recognition experiments are carried out to evaluate the new model using the N-best re-scoring paradigm in a Switchboard task. Compared with a baseline recognizer using the triphone HMM acoustic model, the new recognizer demonstrated improved performance under several experimental conditions. The performance was shown to increase with an increased number of the mixture components in the model.

[1]  John S. Bridle,et al.  The HDM: a segmental hidden dynamic model of coarticulation , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[2]  Li Deng,et al.  Optimization of dynamic regimes in a statistical hidden dynamic model for conversational speech recognition , 1999, EUROSPEECH.

[3]  Li Deng,et al.  A stochastic model of speech incorporating hierarchical nonstationarity , 1993, IEEE Trans. Speech Audio Process..

[4]  Li Deng,et al.  Speaker-independent phonetic classification using hidden Markov models with state-conditioned mixtures of trend functions , 1997 .

[5]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[6]  Yifan Gong,et al.  Stochastic trajectory modeling for speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[8]  R. Shumway,et al.  AN APPROACH TO TIME SERIES SMOOTHING AND FORECASTING USING THE EM ALGORITHM , 1982 .

[9]  R. Shumway,et al.  Dynamic linear models with switching , 1991 .

[10]  Li Deng,et al.  A dynamic, feature-based approach to the interface between phonology and phonetics for speech modeling and recognition , 1998, Speech Commun..

[11]  Mari Ostendorf,et al.  From HMM's to segment models: a unified view of stochastic modeling for speech recognition , 1996, IEEE Trans. Speech Audio Process..

[12]  J. S. Bridle,et al.  An investigation of segmental hidden dynamic models of speech coarticulation for automatic speech recognition , 1998 .

[13]  Li Deng,et al.  Speaker-independent phonetic classification using hidden Markov models with mixtures of trend functions , 1997, IEEE Trans. Speech Audio Process..

[14]  R. Streit,et al.  Probabilistic Multi-Hypothesis Tracking , 1995 .

[15]  L Deng,et al.  Spontaneous speech recognition using a statistical coarticulatory model for the vocal-tract-resonance dynamics. , 2000, The Journal of the Acoustical Society of America.

[16]  Mari Ostendorf,et al.  ML estimation of a stochastic linear system with the EM algorithm and its application to speech recognition , 1993, IEEE Trans. Speech Audio Process..

[17]  Y. Bar-Shalom Tracking and data association , 1988 .

[18]  O. Kimball,et al.  Segment modeling alternatives for continuous speech recognition , 1995 .

[19]  Herbert Gish,et al.  A segmental speech model with applications to word spotting , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  J. Mendel Lessons in Estimation Theory for Signal Processing, Communications, and Control , 1995 .

[21]  Li Deng,et al.  Computational Models for Speech Production , 2018, Speech Processing.

[22]  Li Deng,et al.  A generalized hidden Markov model with state-conditioned trend functions of time for the speech signal , 1992, Signal Process..

[23]  Steve J. Young,et al.  Towards improved speech recognition using a speech production model , 1995, EUROSPEECH.

[24]  Li Deng,et al.  Initial evaluation of hidden dynamic models on conversational speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).