Speech parameter estimation using a vocal tract/Cord model

This paper proposes the use of a vocal cord and tract model for speech coding at bit rates below 4.8 kb/s. For this, a key requirement is the ability to derive model parameters from an input speech signal. Our approach to this problem employs an acoustic analysis front-end, a linked codebook of vocal-tract configurations and related acoustic characteristics, and an optimizing articulatory synthesizer. While the acoustic front-end is relatively straight-forward involving LPC, pitch, and voicing analyses, the codebook design and usage, as well as the specific method for optimizing the model parameters are new. The codebook is intended to provide good starting values for an iterative optimization, thus alleviating the problem of locking on to a locally optimum solution. In a first stage of optimization, the best vocal tract configuration found in the codebook is refined by varying only the vocal tract parameters. Then, in a second stage of optimization, the best match is found between the glottal waveform of the model and the inverse filtered input speech.

[1]  Wolfgang Hess,et al.  Pitch Determination of Speech Signals , 1983 .

[2]  J. Flanagan,et al.  Signal models for low bit‐rate coding of speech , 1980 .

[3]  F. Itakura,et al.  Minimum prediction residual principle applied to speech recognition , 1975 .

[4]  P. Mermelstein Articulatory model for the study of speech production. , 1973, The Journal of the Acoustical Society of America.

[5]  Manfred R. Schroeder,et al.  Code-excited linear prediction(CELP): High-quality speech at very low bit rates , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Waveforms Hisashi Wakita Direct Estimation of the Vocal Tract Shape by Inverse Filtering of Acoustic Speech , 1973 .

[7]  J. Flanagan,et al.  Synthesis of voiced sounds from a two-mass model of the vocal cords , 1972 .

[8]  N. S. Jayant Coding speech at low bit rates: Advanced algorithms and hardware for voice telecommunications are paring hit rates by at least a factor of four, without losing intelligibility , 1986, IEEE Spectrum.

[9]  Bishnu S. Atal,et al.  A new model of LPC excitation for producing natural-sounding speech at low bit rates , 1982, ICASSP.

[10]  Lawrence R. Rabiner,et al.  A modified K-means clustering algorithm for use in isolated work recognition , 1985, IEEE Trans. Acoust. Speech Signal Process..

[11]  B. Atal,et al.  Inversion of articulatory-to-acoustic transformation in the vocal tract by a computer-sorting technique. , 1978, The Journal of the Acoustical Society of America.

[12]  J. R. Resnick,et al.  The inverse problem for the vocal tract: numerical methods, acoustical experiments, and speech synthesis. , 1983, The Journal of the Acoustical Society of America.

[13]  C.H. Coker,et al.  A model of articulatory dynamics and control , 1976, Proceedings of the IEEE.

[14]  S E Levinson,et al.  Adaptive computation of articulatory parameters from the speech signal. , 1982, The Journal of the Acoustical Society of America.

[15]  Man Mohan Sondhi,et al.  A nonlinear articulatory speech synthesizer using both time- and frequency-domain elements , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.