Automatic animation of an articulatory tongue model from ultrasound images using Gaussian mixture regression

This paper presents a method for automatically animating the articulatory tongue model of a reference speaker from ultrasound images of the tongue of another speaker. This work is developed in the context of speech therapy based on visual biofeedback, where a speaker is provided with visual information about his/her own articulation. In our approach, the feedback is delivered via an articulatory talking head, which displays the tongue during speech production using augmented reality (e.g. transparent skin). The user’s tongue movements are captured using ultrasound imaging and parameterized using the PCA-based EigenTongue technique. Extracted features are then converted into control parameters of the articulatory tongue model using Gaussian Mixture Regression. This procedure was evaluated by decoding the converted tongue movements at the phonetic level using an HMM-based decoder trained on the reference speaker's articulatory data. Decoding errors were then manually reassessed in order to take into account possible phonetic idiosyncrasies (i.e. speaker / phoneme specific articulatory strategies). With a system trained on a limited set of 88 VCV sequences, the recognition accuracy at the phonetic level was found to be approximately 70%. Index Terms: articulatory tongue model, articulatory talking head, ultrasound imaging, GMM, speech therapy

[1]  Gérard Bailly,et al.  Visual articulatory feedback for phonetic correction in second language learning , 2010 .

[2]  Gérard Chollet,et al.  Acquisition of Ultrasound, Video and Acoustic Speech Data for a Silent-Speech Interface Application , 2008 .

[3]  Joanna Light,et al.  Using visible speech to train perception and production of speech for individuals with hearing loss. , 2004, Journal of speech, language, and hearing research : JSLHR.

[4]  Gérard Bailly,et al.  Speaker adaptation of an acoustic-articulatory inversion model using cascaded Gaussian mixture regressions , 2013, INTERSPEECH.

[5]  Gérard Bailly,et al.  Acoustic-to-articulatory inversion using speech recognition and trajectory formation based on phoneme hidden Markov models , 2009, INTERSPEECH.

[6]  Olov Engwall Analysis of and feedback on phonetic features in pronunciation training with a virtual teacher , 2012 .

[7]  Gérard Bailly,et al.  Cross-speaker Acoustic-to-Articulatory Inversion using Phone-based Trajectory HMM for Pronunciation Training , 2012, INTERSPEECH.

[8]  Keiichi Tokuda,et al.  Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model , 2008, Speech Commun..

[9]  Gérard Bailly,et al.  An Audiovisual Talking Head for Augmented Speech Generation: Models and Animations Based on a Real Speaker's Articulatory Data , 2008, AMDO.

[10]  Scott T. Acton,et al.  Speckle reducing anisotropic diffusion , 2002, IEEE Trans. Image Process..

[11]  Gérard Chollet,et al.  Eigentongue Feature Extraction for an Ultrasound-Based Silent Speech Interface , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[12]  Olov Engwall,et al.  Can audio-visual instructions help learners improve their articulation? - an ultrasound study of short term changes , 2008, INTERSPEECH.

[13]  J. M. Gilbert,et al.  Silent speech interfaces , 2010, Speech Commun..