Techniques for estimating vocal-tract shapes from the speech signal

This paper reviews methods for mapping from the acoustical properties of a speech signal to the geometry of the vocal tract that generated the signal. Such mapping techniques are studied for their potential application in speech synthesis, coding, and recognition. Mathematically, the estimation of the vocal tract shape from its output speech is a so-called inverse problem, where the direct problem is the synthesis of speech from a given time-varying geometry of the vocal tract and glottis. Different mappings are discussed: mapping via articulatory codebooks, mapping by nonlinear regression, mapping by basis functions, and mapping by neural networks. Besides being nonlinear, the acoustic-to-geometry mapping is also nonunique, i.e., more than one tract geometry might produce the same speech spectrum. The authors show how this nonuniqueness can be alleviated by imposing continuity constraints. >

[1]  Hans Werner Strube Time-varying wave digital filters and vocal-tract models , 1982, ICASSP.

[2]  Katsuhiko Shirai,et al.  Estimation and generation of articulatory motion using neural networks , 1993, Speech Commun..

[3]  Biing-Hwang Juang,et al.  On the use of bandpass liftering in speech recognition , 1987, IEEE Trans. Acoust. Speech Signal Process..

[4]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[5]  Man Mohan Sondhi,et al.  Design and evaluation of optimal cepstral lifters for accessing articulatory codebooks , 1991, IEEE Trans. Signal Process..

[6]  John Nicholas Holmes,et al.  Speech synthesis , 1972 .

[7]  Michael Rodney Portnoff A quasi-one-dimensional digital simulation for the time-varying vocal tract. , 1973 .

[8]  B. Gopinath,et al.  Determination of the shape of the human vocal tract from acoustical measurements , 1970, Bell Syst. Tech. J..

[9]  F. Itakura,et al.  Minimum prediction residual principle applied to speech recognition , 1975 .

[10]  R. Lippmann,et al.  An introduction to computing with neural nets , 1987, IEEE ASSP Magazine.

[11]  C. C. Goodyear,et al.  On the use of neural networks in articulatory speech synthesis , 1993 .

[12]  P. W. Nye,et al.  Analysis of vocal tract shape and dimensions using magnetic resonance imaging: vowels. , 1991, The Journal of the Acoustical Society of America.

[13]  Man Mohan Sondhi,et al.  A hybrid time-frequency domain articulatory speech synthesizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[14]  H. Strube,et al.  A quasiarticulatory speech synthesizer for German language running in real time , 1989 .

[15]  M M Sondhi Resonances of a bent vocal tract. , 1986, The Journal of the Acoustical Society of America.

[16]  Göran Borg Eine Umkehrung der Sturm-Liouvilleschen Eigenwertaufgabe , 1946 .

[17]  Man Mohan Sondhi,et al.  Estimation of vocal-tract areas: The need for acoustical measurements , 1979 .

[18]  Olivier Rioul,et al.  Neural networks for estimating articulatory positions from speech , 1989 .

[19]  Shinji Maeda,et al.  Compensatory Articulation During Speech: Evidence from the Analysis and Synthesis of Vocal-Tract Shapes Using an Articulatory Model , 1990 .

[20]  M. R. Schroeder,et al.  Determination of articulatory parameters of the human vocal tract from acoustic measurements , 1976 .

[21]  Shinji Maeda,et al.  A digital simulation method of the vocal-tract system , 1982, Speech Commun..

[22]  C.H. Coker,et al.  A model of articulatory dynamics and control , 1976, Proceedings of the IEEE.

[23]  B. Atal,et al.  Inversion of articulatory-to-acoustic transformation in the vocal tract by a computer-sorting technique. , 1978, The Journal of the Acoustical Society of America.

[24]  J. Schroeter,et al.  Speech parameter estimation using a vocal tract/Cord model , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[25]  M H Cohen,et al.  Electromagnetic midsagittal articulometer systems for transducing speech articulatory movements. , 1992, The Journal of the Acoustical Society of America.

[26]  Juergen Schroeter,et al.  Speech coding based on physiological models of speech production , 1992 .

[27]  J. Mize Optimization Techniques With Fortran , 1973 .

[28]  B. Lindblom,et al.  Acoustical consequences of lip, tongue, jaw, and larynx movement. , 1970, The Journal of the Acoustical Society of America.

[29]  J. L. Flanagan,et al.  Synthesis of speech from a dynamic model of the vocal cords and vocal tract , 1975, The Bell System Technical Journal.

[30]  O. Fujimura,et al.  Tongue-pellet tracking by a computer-controlled x-ray microbeam system;. , 1975, The Journal of the Acoustical Society of America.

[31]  Dimiter Dobrev,et al.  Computer Simulation , 1966, J. Inf. Process. Cybern..

[32]  Marco Saerens,et al.  Acoustic-articulatory inversion based on a neural controller of a vocal tract model , 1990, SSW.

[33]  R. Hecht-Nielsen Counterpropagation networks. , 1987, Applied optics.

[34]  Sarangarajan Parthasarathy,et al.  Evaluation of improved articulatory codebooks and codebook access distance measures , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[35]  Biing-Hwang Juang,et al.  Optimal quantization of LSP parameters , 1993, IEEE Trans. Speech Audio Process..

[36]  F. Beaufils,et al.  FRANCE , 1979, The Lancet.

[37]  M. Sondhi,et al.  Determination of vocal-tract shape from impulse response at the lips. , 1971, The Journal of the Acoustical Society of America.

[38]  M. Sondhi Model for wave propagation in a lossy vocal tract. , 1974, The Journal of the Acoustical Society of America.

[39]  J. Flanagan,et al.  Signal models for low bit‐rate coding of speech , 1980 .

[40]  P. Mermelstein Articulatory model for the study of speech production. , 1973, The Journal of the Acoustical Society of America.

[41]  D.R. Hush,et al.  Progress in supervised neural networks , 1993, IEEE Signal Processing Magazine.

[42]  Donald G. Childers,et al.  Variability in closed phase analysis of speech , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[43]  G Papcun,et al.  Inferring articulation and recognizing gestures from acoustics with a neural network trained on x-ray microbeam data. , 1992, The Journal of the Acoustical Society of America.

[44]  Sarangarajan Parthasarathy,et al.  Vocal‐tract areas versus articulatory parameters in speech production modeling , 1988 .

[45]  J. R. Resnick,et al.  The inverse problem for the vocal tract: numerical methods, acoustical experiments, and speech synthesis. , 1983, The Journal of the Acoustical Society of America.

[46]  Stephen A. Dyer,et al.  Digital signal processing , 2018, 8th International Multitopic Conference, 2004. Proceedings of INMIC 2004..

[47]  P. Mermelstein Determination of the vocal-tract shape from measured formant frequencies. , 1967, The Journal of the Acoustical Society of America.

[48]  J. Flanagan Speech Analysis, Synthesis and Perception , 1971 .

[49]  Man Mohan Sondhi,et al.  Dynamic programming search of articulatory codebooks , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[50]  D H Klatt,et al.  Review of text-to-speech conversion for English. , 1987, The Journal of the Acoustical Society of America.

[51]  O. Fujimura,et al.  Model for Specification of the Vocal‐Tract Area Function , 1966 .

[52]  J A Kelso,et al.  An evaluation of an alternating magnetic field device for monitoring tongue movements. , 1990, The Journal of the Acoustical Society of America.

[53]  Joseph S. Perkell,et al.  A physiologically-oriented model of tongue activity in speech production , 1974 .

[54]  K. Shirai,et al.  Estimation of articulatory motion using neural networks , 1991 .

[55]  T.H. Crystal,et al.  Linear prediction of speech , 1977, Proceedings of the IEEE.

[56]  Man Mohan Sondhi,et al.  Vector quantization of the articulatory space , 1988, IEEE Trans. Acoust. Speech Signal Process..

[57]  M. Schroeder Determination of the geometry of the human vocal tract by acoustic measurements. , 1967, The Journal of the Acoustical Society of America.

[58]  Juergen Schroeter,et al.  Pitch‐synchronous frame‐by‐frame and segment‐based articulatory analysis by synthesis , 1993 .

[59]  Harold T. Edwards,et al.  Applied Phonetics: The Sounds of American English , 1992 .

[60]  L Saltzman Elliot,et al.  A Dynamical Approach to Gestural Patterning in Speech Production , 1989 .

[61]  M. Stone A three-dimensional model of tongue movement based on ultrasound and x-ray microbeam data. , 1990, The Journal of the Acoustical Society of America.

[62]  B. S. Atal,et al.  Determination of the Vocal‐Tract Shape Directly from the Speech Wave , 1970 .

[63]  M. M. Sondhi,et al.  Determination of the Shape of a Lossy Vocal Tract , 1971 .