A method to combine acoustic and morphological constraints in the speech production inverse problem

Abstract This paper approaches the articulatory-to-acoustic speech production inverse case. A framework based on an explicit combination of vocal-tract morphological and acoustic constraints is proposed. The solution is based on a Fourier analysis of the vocal-tract log-area function: the relationship between the log-area Fourier cosine coefficients and the corresponding formants is used to formulate an acoustic constraint. The same set of coefficients is then used to express a morphological constraint. This representation of both acoustic and morphological constraints in the same parameter space allows an efficient solution for the inverse problem. The basis of the acoustic constraint formulation was first proposed by Mermelstein (1967). However, at that time, the combination with morphological constraints was not realized. The method is tested for some vowels. The results confirm the validity of the method, but they also show the need for dynamic constraints.

[1]  A. Marchal,et al.  Speech production and speech modelling , 1990 .

[2]  B. Atal,et al.  Inversion of articulatory-to-acoustic transformation in the vocal tract by a computer-sorting technique. , 1978, The Journal of the Acoustical Society of America.

[3]  Man Mohan Sondhi,et al.  A hybrid time-frequency domain articulatory speech synthesizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[4]  J. Flanagan A Difference Limen for Vowel Formant Frequency , 1955 .

[5]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[6]  A G Webster,et al.  Acoustical Impedance and the Theory of Horns and of the Phonograph. , 1919, Proceedings of the National Academy of Sciences of the United States of America.

[7]  P. Ladefoged,et al.  Factor analysis of tongue shapes. , 1971, Journal of the Acoustical Society of America.

[8]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[9]  Man Mohan Sondhi,et al.  Estimation of vocal-tract areas: The need for acoustical measurements , 1979 .

[10]  P. Ladefoged,et al.  Generating vocal tract shapes from formant frequencies. , 1978, The Journal of the Acoustical Society of America.

[11]  J. Perkell Physiology of speech production: results and implications of a quantitative cineradiographic study , 1969 .

[12]  J. R. Resnick,et al.  The inverse problem for the vocal tract: numerical methods, acoustical experiments, and speech synthesis. , 1983, The Journal of the Acoustical Society of America.

[13]  G Fant,et al.  The Relations between Area Functions and the Acoustic Signal , 1980, Phonetica.

[14]  Peter No,et al.  Digital Coding of Waveforms , 1986 .

[15]  Shinji Maeda,et al.  A digital simulation method of the vocal-tract system , 1982, Speech Commun..

[16]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[17]  M. Schroeder Determination of the geometry of the human vocal tract by acoustic measurements. , 1967, The Journal of the Acoustical Society of America.

[18]  Hani Yehia,et al.  Determination of human vocal-tract dynamic geometry from formant trajectories using spatial and temporal Fourier analysis , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Shinji Maeda,et al.  Compensatory Articulation During Speech: Evidence from the Analysis and Synthesis of Vocal-Tract Shapes Using an Articulatory Model , 1990 .

[20]  P. Mermelstein Determination of the vocal-tract shape from measured formant frequencies. , 1967, The Journal of the Acoustical Society of America.

[21]  Katsuhiko Shirai,et al.  Estimating articulatory motion from speech wave , 1986, Speech Commun..

[22]  J. Flanagan Speech Analysis, Synthesis and Perception , 1971 .

[23]  John Nicholas Holmes,et al.  Speech synthesis , 1972 .

[24]  Juergen Schroeter,et al.  Pitch‐synchronous frame‐by‐frame and segment‐based articulatory analysis by synthesis , 1993 .

[25]  Katsuhiko Shirai,et al.  Estimation and generation of articulatory motion using neural networks , 1993, Speech Commun..

[26]  T.H. Crystal,et al.  Linear prediction of speech , 1977, Proceedings of the IEEE.

[27]  Man Mohan Sondhi,et al.  Techniques for estimating vocal-tract shapes from the speech signal , 1994, IEEE Trans. Speech Audio Process..

[28]  Juergen Schroeter,et al.  Speech coding based on physiological models of speech production , 1992 .

[29]  A. Paige,et al.  Calculation of vocal tract length , 1970 .

[30]  E. Eisner,et al.  Complete Solutions of the “Webster” Horn Equation , 1967 .

[31]  Gérard Bailly,et al.  Formant trajectories as audible gestures: An alternative for speech synthesis , 1991 .

[32]  Waveforms Hisashi Wakita Direct Estimation of the Vocal Tract Shape by Inverse Filtering of Acoustic Speech , 1973 .

[33]  Iise Lehiste,et al.  Readings in Acoustic Phonetics , 1968 .

[34]  Richard S. McGowan,et al.  Recovering articulatory movement from formant frequency trajectories using task dynamics and a genetic algorithm: Preliminary model tests , 1994, Speech Commun..

[35]  M. Sondhi Model for wave propagation in a lossy vocal tract. , 1974, The Journal of the Acoustical Society of America.

[36]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[37]  J. Flanagan,et al.  Signal models for low bit‐rate coding of speech , 1980 .

[38]  M.G. Bellanger,et al.  Digital processing of speech signals , 1980, Proceedings of the IEEE.

[39]  H. Wakita Estimation of vocal-tract shapes from acoustical analysis of the speech wave: The state of the art , 1979 .

[40]  Michael I. Jordan Motor Learning and the Degrees of Freedom Problem , 2018, Attention and Performance XIII.