Use of Differential Cepstra as Acoustic Features in Hidden Trajectory Modeling for Phonetic Recognition

The earlier version of the hidden trajectory model (HTM) for speech dynamics which predicts the "static" cepstra as the observed acoustic feature is generalized to one which predicts joint static cepstra and their temporal differentials (i.e., delta cepstra). The formulation of this generalized HTM is presented in the generative-modeling framework, enabling efficient computation of the joint likelihood for both static and delta cepstral sequences as the acoustic features given the model. The parameter estimation techniques for the new model are developed and presented, giving closed-form estimation formulas after the use of vector Taylor series approximation. We show principled generalization from the earlier static-cepstra HTM to the new static/delta-cepstra HTM not only in terms of model formulations but also in terms of their respective analytical forms in (monophone) parameter estimation. Experimental results on the standard TIMIT phonetic recognition task demonstrate recognition accuracy improvement over the earlier best HTM system, both significantly better than state-of-the-art triphone HMM systems.

[1]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[2]  David H. Bailey,et al.  Algorithms and applications , 1988 .

[3]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[4]  John Coleman,et al.  Acoustics of American English speech : a dynamic approach , 1993 .

[5]  Jean-Luc Gauvain,et al.  High performance speaker-independent phone recognition using CDHMM , 1993, EUROSPEECH.

[6]  Anthony J. Robinson,et al.  An application of recurrent nets to phone probability estimation , 1994, IEEE Trans. Neural Networks.

[7]  Mari Ostendorf,et al.  From HMM's to segment models: a unified view of stochastic modeling for speech recognition , 1996, IEEE Trans. Speech Audio Process..

[8]  James R. Glass,et al.  Heterogeneous measurements and multiple classifiers for speech recognition , 1998, ICSLP.

[9]  Alex Acero,et al.  Spoken Language Processing , 2001 .

[10]  James R. Glass A probabilistic framework for segment-based speech recognition , 2003, Comput. Speech Lang..

[11]  Coarticulation • Suprasegmentals,et al.  Acoustic Phonetics , 2019, The SAGE Encyclopedia of Human Communication Sciences and Disorders.

[12]  Heiga Zen,et al.  A Viterbi algorithm for a trajectory model derived from HMM with explicit relationship between static and dynamic features , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Dong Yu,et al.  Learning statistically characterized resonance targets in a hidden trajectory model of speech coarticulation and reduction , 2005, INTERSPEECH.

[14]  Christopher K. I. Williams How to Pretend That Correlated Variables Are Independent by Using Difference Observations , 2005, Neural Computation.

[15]  Abeer Alwan,et al.  A Database of Vocal Tract Resonance Trajectories for Research in Speech Processing , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[16]  Pavel Matejka,et al.  Hierarchical Structures of Neural Networks for Phoneme Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[17]  Dong Yu,et al.  Structured speech modeling , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Eric Fosler-Lussier,et al.  Combining phonetic attributes using conditional random fields , 2006, INTERSPEECH.

[19]  Lawrence K. Saul,et al.  Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[20]  Dong Yu,et al.  A lattice search technique for a long-contextual-span hidden trajectory model of speech , 2006, Speech Commun..

[21]  Li Deng,et al.  Adaptive Kalman Filtering and Smoothing for Tracking Vocal Tract Resonances Using a Continuous-Valued Hidden Dynamic Model , 2007, IEEE Transactions on Audio, Speech, and Language Processing.