Adaptive Kalman Filtering and Smoothing for Tracking Vocal Tract Resonances Using a Continuous-Valued Hidden Dynamic Model

A novel Kalman filtering/smoothing algorithm is presented for efficient and accurate estimation of vocal tract resonances or formants, which are natural frequencies and bandwidths of the resonator from larynx to lips, in fluent speech. The algorithm uses a hidden dynamic model, with a state-space formulation, where the resonance frequency and bandwidth values are treated as continuous-valued hidden state variables. The observation equation of the model is constructed by an analytical predictive function from the resonance frequencies and bandwidths to LPC cepstra as the observation vectors. This nonlinear function is adaptively linearized, and a residual or bias term, which is adaptively trained, is added to the nonlinear function to represent the iteratively reduced piecewise linear approximation error. Details of the piecewise linearization design process are described. An iterative tracking algorithm is presented, which embeds both the adaptive residual training and piecewise linearization design in the Kalman filtering/smoothing framework. Experiments on estimating resonances in Switchboard speech data show accurate estimation results. In particular, the effectiveness of the adaptive residual training is demonstrated. Our approach provides a solution to the traditional "hidden formant problem," and produces meaningful results even during consonantal closures when the supra-laryngeal source may cause no spectral prominences in speech acoustics

[1]  S. McCandless,et al.  An algorithm for automatic formant extraction using linear prediction spectra , 1974 .

[2]  G. Rigoll A new algorithm for estimation of formant trajectories directly from the speech signal based on an extended Kalman-filter , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Gary E. Kopec Formant tracking using hidden Markov models and vector quantization , 1986, IEEE Trans. Acoust. Speech Signal Process..

[4]  D. Talkin Speech formant trajectory estimation using dynamic programming with modulated transition costs , 1987 .

[5]  G. Rigoll Formant tracking with quasilinearization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[6]  D. Broad,et al.  Formant estimation by linear trans-formation of the lpc cepstrum , 1989 .

[7]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[8]  I.J. Cox,et al.  Recursive tracking of formants in speech signals , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Man Mohan Sondhi,et al.  Techniques for estimating vocal-tract shapes from the speech signal , 1994, IEEE Trans. Speech Audio Process..

[10]  Steve J. Young,et al.  Towards improved speech recognition using a speech production model , 1995, EUROSPEECH.

[11]  M M Sondhi,et al.  The potential role of speech production models in automatic speech recognition. , 1996, The Journal of the Acoustical Society of America.

[12]  R. S. McGowan,et al.  Acoustic 1996: Speech production parameters for automatic speech recognition , 1997 .

[13]  J. Hogberg Prediction of formant frequencies from linear combinations of filterbank and cepstral coefficients , 1997 .

[14]  R. S. Mcgowan,et al.  SPEECH PRODUCTION PARAMETERS FOR AUTOMATIC SPEECH RECOGNITION 43.72.NE, 43.70.AJ , 1997 .

[15]  Hermann Ney,et al.  Formant estimation for speech recognition , 1998, IEEE Trans. Speech Audio Process..

[16]  Li Deng,et al.  Recovering vocal tract shapes from MFCC parameters , 1998, ICSLP.

[17]  Li Deng,et al.  A dynamic, feature-based approach to the interface between phonology and phonetics for speech modeling and recognition , 1998, Speech Commun..

[18]  Li Deng,et al.  Computational Models for Speech Production , 2018, Speech Processing.

[19]  L Deng,et al.  Spontaneous speech recognition using a statistical coarticulatory model for the vocal-tract-resonance dynamics. , 2000, The Journal of the Acoustical Society of America.

[20]  Li Deng,et al.  Data-driven model construction for continuous speech recognition using overlapping articulatory features , 2000, INTERSPEECH.

[21]  Jing Huang,et al.  Multistage coarticulation model combining articulatory, formant and cepstral features , 2000, INTERSPEECH.

[22]  Simon King,et al.  ASR - articulatory speech recognition , 2001, INTERSPEECH.

[23]  Simon King,et al.  Modelling the uncertainty in recovering articulation from acoustics , 2003, Comput. Speech Lang..

[24]  Li Deng,et al.  Tracking vocal tract resonances using an analytical nonlinear predictor and a target-guided temporal constraint , 2003, INTERSPEECH.

[25]  Li Deng,et al.  Coarticulation modeling by embedding a target-directed hidden trajectory model into HMM - MAP decoding and evaluation , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[26]  Li Deng,et al.  An expectation maximization approach for formant tracking using a parameter-free non-linear predictor , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[27]  Wendy J. Holmes,et al.  Segmental HMMs: Modeling Dynamics and Underlying Structure in Speech , 2004 .

[28]  Li Deng,et al.  A structured speech model with continuous hidden dynamics and prediction-residual training for tracking vocal tract resonances , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[29]  Shigeru Katagiri,et al.  Bayesian modelling of the speech spectrum using mixture of Gaussians , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[30]  Paul W. Fieguth,et al.  A multimodal variational approach to learning and inference in switching state space models [speech processing application] , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[31]  Mark Hasegawa-Johnson,et al.  Formant tracking by mixture state particle filter , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[32]  Li Deng,et al.  Tracking Vocal Tract Resonances Using a Quantized Nonlinear Function Embedded in a Temporal Constraint , 2006, IEEE Transactions on Audio, Speech, and Language Processing.