A structured speech model with continuous hidden dynamics and prediction-residual training for tracking vocal tract resonances

A novel approach is developed for efficient and accurate tracking of vocal tract resonances, which are natural frequencies of the resonator from larynx to lips, in fluent speech. The tracking algorithm is based on a version of the structured speech model consisting of continuous-valued hidden dynamics and a piecewise-linearized prediction function from resonance frequencies and bandwidths to LPC cepstra. We present details of the piecewise linearization design process and an adaptive training technique for the parameters that characterize the prediction residuals. An iterative tracking algorithm is described and evaluated that embeds both the prediction-residual training and the piecewise linearization design in an adaptive Kalman filtering framework. Experiments on tracking vocal tract resonances in Switchboard speech data demonstrate high accuracy in the results, as well as the effectiveness of residual training embedded in the algorithm. Our approach differs from traditional formant trackers in that it provides meaningful results even during consonantal closures when the supra-laryngeal source may cause no spectral prominences in speech acoustics.

[1]  L Deng,et al.  Spontaneous speech recognition using a statistical coarticulatory model for the vocal-tract-resonance dynamics. , 2000, The Journal of the Acoustical Society of America.

[2]  Li Deng,et al.  Data-driven model construction for continuous speech recognition using overlapping articulatory features , 2000, INTERSPEECH.

[3]  D. Talkin Speech formant trajectory estimation using dynamic programming with modulated transition costs , 1987 .

[4]  Hermann Ney,et al.  Formant estimation for speech recognition , 1998, IEEE Trans. Speech Audio Process..

[5]  Li Deng,et al.  An expectation maximization approach for formant tracking using a parameter-free non-linear predictor , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[6]  Simon King,et al.  Modelling the uncertainty in recovering articulation from acoustics , 2003, Comput. Speech Lang..

[7]  Li Deng,et al.  Recovering vocal tract shapes from MFCC parameters , 1998, ICSLP.

[8]  S. McCandless,et al.  An algorithm for automatic formant extraction using linear prediction spectra , 1974 .

[9]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[10]  Li Deng,et al.  Tracking vocal tract resonances using an analytical nonlinear predictor and a target-guided temporal constraint , 2003, INTERSPEECH.

[11]  Jing Huang,et al.  Multistage coarticulation model combining articulatory, formant and cepstral features , 2000, INTERSPEECH.

[12]  Li Deng,et al.  Coarticulation modeling by embedding a target-directed hidden trajectory model into HMM - MAP decoding and evaluation , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[13]  Gary E. Kopec Formant tracking using hidden Markov models and vector quantization , 1986, IEEE Trans. Acoust. Speech Signal Process..