An unsupervised method for learning to track tongue position from an acoustic signal.

A procedure for learning to recover the relative positions of the articulators from speech signals is demonstrated. The algorithm learns without supervision, that is, it does not require information about which articulator configurations created the acoustic signals in the training set. The procedure consists of vector quantizing short time windows of a speech signal, then using multidimensional scaling to represent quantization codes that were temporally close in the encoded speech signal by nearby points in a continuity map. Since temporally close sounds must have been produced by similar articulator configurations, sounds which were produced by similar articulator positions should be represented close to each other in the continuity map. Using an articulatory speech synthesizer to produce acoustic signals from known articulator positions, relative articulator positions were estimated from synthesized acoustic signals and compared to the synthesizer’s actual articulator positions. High rank‐order correl...

[1]  K. Stevens,et al.  Development of a Quantitative Description of Vowel Articulation , 1955 .

[2]  A M Liberman,et al.  Perception of the speech code. , 1967, Psychological review.

[3]  J. Flanagan Speech Analysis, Synthesis and Perception , 1971 .

[4]  P. Mermelstein Articulatory model for the study of speech production. , 1973, The Journal of the Acoustical Society of America.

[5]  J. Gower Generalized procrustes analysis , 1975 .

[6]  C.H. Coker,et al.  A model of articulatory dynamics and control , 1976, Proceedings of the IEEE.

[7]  John E. Markel,et al.  Linear Prediction of Speech , 1976, Communication and Cybernetics.

[8]  P. Ladefoged,et al.  Factor analysis of tongue shapes. , 1971, Journal of the Acoustical Society of America.

[9]  James Lubker,et al.  Formant frequencies of some fixed-mandible vowels and a model of speech motor programming by predict , 1977 .

[10]  P. Ladefoged,et al.  Generating vocal tract shapes from formant frequencies. , 1978, The Journal of the Acoustical Society of America.

[11]  B. Atal,et al.  Inversion of articulatory-to-acoustic transformation in the vocal tract by a computer-sorting technique. , 1978, The Journal of the Acoustical Society of America.

[12]  S. Maeda An articulatory model of the tongue based on a statistical analysis , 1979 .

[13]  Man Mohan Sondhi,et al.  Estimation of vocal-tract areas: The need for acoustical measurements , 1979 .

[14]  M T Turvey,et al.  Immediate Compensation in Bite-Block Speech , 1980, Phonetica.

[15]  R. Gray,et al.  Vector quantization , 1984, IEEE ASSP Magazine.

[16]  R. Kuc,et al.  Determining vocal tract shape by applying dynamic constraints , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  A. Liberman,et al.  The motor theory of speech perception revised , 1985, Cognition.

[18]  R. Cranley,et al.  Multivariate Analysis—Methods and Applications , 1985 .

[19]  Thomas H. Shawker,et al.  Distinguisability of tongue shape during vowel production , 1985 .

[20]  Katsuhiko Shirai,et al.  Estimating articulatory motion from speech wave , 1986, Speech Commun..

[21]  Douglas D. O'Shaughnessy,et al.  Speech communication : human and machine , 1987 .

[22]  Teuvo Kohonen,et al.  The 'neural' phonetic typewriter , 1988, Computer.

[23]  M. Kawato Motor theory of speech perception revisited from minimum-torque-change neural network model , 1989 .

[24]  Sarangarajan Parthasarathy,et al.  Evaluation of improved articulatory codebooks and codebook access distance measures , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[25]  Shinji Maeda,et al.  Compensatory Articulation During Speech: Evidence from the Analysis and Synthesis of Vocal-Tract Shapes Using an Articulatory Model , 1990 .

[26]  W. Bastiaan Kleijn,et al.  Acoustic to articulatory parameter mapping using an assembly of neural networks , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[27]  John Hogden Low-dimensional phoneme mapping using a continuity constraint , 1991 .

[28]  Biing-Hwang Juang,et al.  Hidden Markov Models for Speech Recognition , 1991 .

[29]  Pascal Perrier,et al.  The geometric vocal tract variables controlled for vowel production: proposals for constraining acoustic-to-articulatory inversion , 1992 .

[30]  Richard S. McGowan Recovering articulator trajectories using task dynamics and a genetic algorithm , 1992 .

[31]  G Papcun,et al.  Inferring articulation and recognizing gestures from acoustics with a neural network trained on x-ray microbeam data. , 1992, The Journal of the Acoustical Society of America.

[32]  Michael I. Jordan,et al.  Forward Models: Supervised Learning with a Distal Teacher , 1992, Cogn. Sci..

[33]  E. Saltzman,et al.  Unsupervised neural networks that use a continuity constraint to track articulators , 1992 .

[34]  Juergen Schroeter,et al.  Speech coding based on physiological models of speech production , 1992 .

[35]  Elliot Saltzman,et al.  Inferring articulator positions from acoustics: An electromagnetic midsagittal articulometer experiment , 1993 .