Computational Models for Speech Production

Major speech production models from speech science literature and a number of popular statistical “generative” models of speech used in speech technology are surveyed. Strengths and weaknesses of these two styles of speech models are analyzed, pointing to the need to integrate the respective strengths while eliminating the respective weaknesses. As an example, a statistical task-dynamic model of speech production is described, motivated by the original deterministic version of the model and targeted for integrated-multilingual speech recognition applications. Methods for model parameter learning (training) and for likelihood computation (recognition) are described based on statistical optimization principles integrated in neural network and dynamic system theories.

[1]  Kenneth N. Stevens,et al.  On the quantal nature of speech , 1972 .

[2]  Chin-W. Kim,et al.  Models of Speech Production , 1972, Formal Aspects of Cognitive Processes.

[3]  Raymond D. Kent,et al.  chapter 3 – Models of Speech Production , 1976 .

[4]  G. Kitagawa,et al.  Smoothness Priors in Time Series. , 1987 .

[5]  L Saltzman Elliot,et al.  A Dynamical Approach to Gestural Patterning in Speech Production , 1989 .

[6]  Li Deng,et al.  A generalized hidden Markov model with state-conditioned trend functions of time for the speech signal , 1992, Signal Process..

[7]  Oded Ghitza,et al.  Hidden Markov models with templates as non-stationary states: an application to speech recognition , 1993, Comput. Speech Lang..

[8]  Herbert Gish,et al.  A segmental speech model with applications to word spotting , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  J. R. Rohlicek,et al.  ML estimation of a stochastic linear system with the EM algorithm and its application to speech recognition , 1993, IEEE Trans. Speech Audio Process..

[10]  L. Deng Design of a feature‐based speech recognizer aiming at integration of auditory processing, signal modeling, and phonological structure of speech , 1993 .

[11]  Li Deng,et al.  Speech recognition using the atomic speech units constructed from overlapping articulatory features , 1994, EUROSPEECH.

[12]  Richard S. McGowan,et al.  Recovering articulatory movement from formant frequency trajectories using task dynamics and a genetic algorithm: Preliminary model tests , 1994, Speech Commun..

[13]  J. Mendel Lessons in Estimation Theory for Signal Processing, Communications, and Control , 1995 .

[14]  Steve J. Young,et al.  Towards improved speech recognition using a speech production model , 1995, EUROSPEECH.

[15]  Martin J. Russell,et al.  Speech recognition using a linear dynamic segmental HMM , 1995, EUROSPEECH.

[16]  Mari Ostendorf,et al.  From HMMS to Segment Models: Stochastic Modeling for CSR , 1996 .

[17]  Li Deng,et al.  Transitional speech units and their representation by regressive Markov states: applications to speech recognition , 1996, IEEE Trans. Speech Audio Process..

[18]  Li Deng,et al.  Optimal filtering and smoothing for speech recognition using a stochastic target model , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[19]  G. Kitagawa Smoothness priors analysis of time series , 1996 .

[20]  Li Deng,et al.  Speaker-independent phonetic classification using hidden Markov models with mixtures of trend functions , 1997, IEEE Trans. Speech Audio Process..

[21]  Li Deng,et al.  HMM-based speech recognition using state-dependent, discriminatively derived transforms on mel-warped DFT features , 1997, IEEE Trans. Speech Audio Process..

[22]  Li Deng,et al.  Production models as a structural basis for automatic speech recognition , 1997, Speech Commun..

[23]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .