Structured speech modeling

Modeling dynamic structure of speech is a novel paradigm in speech recognition research within the generative modeling framework, and it offers a potential to overcome limitations of the current hidden Markov modeling approach. Analogous to structured language models where syntactic structure is exploited to represent long-distance relationships among words , the structured speech model described in this paper makes use of the dynamic structure in the hidden vocal tract resonance space to characterize long-span contextual influence among phonetic units. A general overview is provided first on hierarchically classified types of dynamic speech models in the literature. A detailed account is then given for a specific model type called the hidden trajectory model, and we describe detailed steps of model construction and the parameter estimation algorithms. We show how the use of resonance target parameters and their temporal filtering enables joint modeling of long-span coarticulation and phonetic reduction effects. Experiments on phonetic recognition evaluation demonstrate superior recognizer performance over a modern hidden Markov model-based system. Error analysis shows that the greatest performance gain occurs within the sonorant speech class

[1]  B. Lindblom Spectrographic Study of Vowel Reduction , 1963 .

[2]  Alan V. Oppenheim,et al.  Discrete representation of signals , 1972 .

[3]  Bishnu S. Atal,et al.  Efficient coding of LPC parameters by temporal decomposition , 1983, ICASSP.

[4]  A. Poritz,et al.  Hidden Markov models: a guided tour , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[5]  Modeling of contextual effects based on spectral peak interaction , 1989 .

[6]  L Saltzman Elliot,et al.  A Dynamical Approach to Gestural Patterning in Speech Production , 1989 .

[7]  Patrick Kenny,et al.  A linear predictive HMM for vector-valued observations with applications to speech recognition , 1990, IEEE Trans. Acoust. Speech Signal Process..

[8]  Li Deng,et al.  A generalized hidden Markov model with state-conditioned trend functions of time for the speech signal , 1992, Signal Process..

[9]  Oded Ghitza,et al.  Hidden Markov models with templates as non-stationary states: an application to speech recognition , 1993, Comput. Speech Lang..

[10]  Herbert Gish,et al.  A segmental speech model with applications to word spotting , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  J. R. Rohlicek,et al.  ML estimation of a stochastic linear system with the EM algorithm and its application to speech recognition , 1993, IEEE Trans. Speech Audio Process..

[12]  Hamid Sheikhzadeh,et al.  Waveform-based speech recognition using hidden filter models: parameter selection and sensitivity to power normalization , 1994, IEEE Trans. Speech Audio Process..

[13]  Xiaodong Sun,et al.  Speech recognition using hidden Markov models with polynomial regression functions as nonstationary states , 1994, IEEE Trans. Speech Audio Process..

[14]  Li Deng,et al.  A Markov model containing state-conditioned second-order non-stationarity: application to speech recognition , 1995, Comput. Speech Lang..

[15]  Mari Ostendorf,et al.  From HMM's to segment models: a unified view of stochastic modeling for speech recognition , 1996, IEEE Trans. Speech Audio Process..

[16]  M M Sondhi,et al.  The potential role of speech production models in automatic speech recognition. , 1996, The Journal of the Acoustical Society of America.

[17]  Li Deng,et al.  Speaker-independent phonetic classification using hidden Markov models with mixtures of trend functions , 1997, IEEE Trans. Speech Audio Process..

[18]  Li Deng,et al.  Production models as a structural basis for automatic speech recognition , 1997, Speech Commun..

[19]  Li Deng,et al.  A dynamic, feature-based approach to the interface between phonology and phonetics for speech modeling and recognition , 1998, Speech Commun..

[20]  J. S. Bridle,et al.  An investigation of segmental hidden dynamic models of speech coarticulation for automatic speech recognition , 1998 .

[21]  Li Deng,et al.  Computational Models for Speech Production , 2018, Speech Processing.

[22]  Martin J. Russell,et al.  Probabilistic-trajectory segmental HMMs , 1999, Comput. Speech Lang..

[23]  L Deng,et al.  Spontaneous speech recognition using a statistical coarticulatory model for the vocal-tract-resonance dynamics. , 2000, The Journal of the Acoustical Society of America.

[24]  Jing Huang,et al.  Multistage coarticulation model combining articulatory, formant and cepstral features , 2000, INTERSPEECH.

[25]  Hsiao-Wuen Hon,et al.  Unified frame and segment based models for automatic speech recognition , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[26]  Frederick Jelinek,et al.  Structured language modeling , 2000, Comput. Speech Lang..

[27]  M. Pitermann Effect of speaking rate and contrastive stress on formant dynamics and vowel perception. , 2000, The Journal of the Acoustical Society of America.

[28]  Li Deng,et al.  A maximum a posteriori approach to speaker adaptation using the trended hidden Markov model , 2001, IEEE Trans. Speech Audio Process..

[29]  Simon King,et al.  ASR - articulatory speech recognition , 2001, INTERSPEECH.

[30]  Kenneth N Stevens,et al.  Toward a model for lexical access based on acoustic landmarks and distinctive features. , 2002, The Journal of the Acoustical Society of America.

[31]  Chak-Fai Li,et al.  An efficient incremental likelihood evaluation for polynomial trajectory model using with application to model training and recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[32]  Li Deng,et al.  Coarticulation modeling by embedding a target-directed hidden trajectory model into HMM - MAP decoding and evaluation , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[33]  Geoffrey Zweig,et al.  Bayesian network structures and inference techniques for automatic speech recognition , 2003, Comput. Speech Lang..

[34]  James R. Glass A probabilistic framework for segment-based speech recognition , 2003, Comput. Speech Lang..

[35]  E. McDermott,et al.  Recognition method with parametric trajectory generated from mixture distribution HMMs , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[36]  Li Deng,et al.  Coarticulation modeling by embedding a target-directed hidden trajectory model into HMM - model and training , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[37]  Coarticulation • Suprasegmentals,et al.  Acoustic Phonetics , 2019, The SAGE Encyclopedia of Human Communication Sciences and Disorders.

[38]  Andreas Stolcke,et al.  The use of a linguistically motivated language model in conversational speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[39]  Li Deng,et al.  Challenges in adopting speech recognition , 2004, CACM.

[40]  Heiga Zen,et al.  A Viterbi algorithm for a trajectory model derived from HMM with explicit relationship between static and dynamic features , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[41]  A quantitative model for formant dynamics and contextually assimilated reduction in fluent speech , 2004, INTERSPEECH.

[42]  C.-H. Lee,et al.  From knowledge-ignorant to knowledge-rich modeling : a new speech research parading for next generation automatic speech recognition , 2004 .

[43]  J. C. Krause,et al.  Acoustic properties of naturally produced clear speech at normal speaking rates. , 1996, The Journal of the Acoustical Society of America.

[44]  Li Deng,et al.  Target-directed mixture dynamic models for spontaneous speech recognition , 2004, IEEE Transactions on Speech and Audio Processing.

[45]  Mark J. F. Gales,et al.  Temporally varying model parameters for large vocabulary continuous speech recognition , 2005, INTERSPEECH.

[46]  Fernando Pereira Linear models for structure prediction , 2005, INTERSPEECH.

[47]  Dong Yu,et al.  Learning statistically characterized resonance targets in a hidden trajectory model of speech coarticulation and reduction , 2005, INTERSPEECH.

[48]  Xiang Li,et al.  A hidden trajectory model with bi-directional target filtering: cascaded vs. integrated implementation for phonetic recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[49]  Dong Yu,et al.  Evaluation of a long-contextual-Span hidden trajectory model and phonetic recognizer using a* lattice search , 2005, INTERSPEECH.

[50]  Elizabeth Shriberg,et al.  Spontaneous speech: how people really talk and why engineers should care , 2005, INTERSPEECH.

[51]  N. Morgan,et al.  Pushing the envelope - aside [speech recognition] , 2005, IEEE Signal Processing Magazine.

[52]  J.A. Bilmes,et al.  Graphical model architectures for speech recognition , 2005, IEEE Signal Processing Magazine.

[53]  Wu Chou,et al.  Speech Technology and Systems in Human-Machine Communication , 2005 .

[54]  Dong Yu,et al.  A bidirectional target-filtering model of speech coarticulation and reduction: two-stage implementation for phonetic recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[55]  Li Deng,et al.  Tracking Vocal Tract Resonances Using a Quantized Nonlinear Function Embedded in a Temporal Constraint , 2006, IEEE Transactions on Audio, Speech, and Language Processing.