An Audiovisual Talking Head for Augmented Speech Generation: Models and Animations Based on a Real Speaker's Articulatory Data

We present a methodology developed to derive three-dimensional models of speech articulators from volume MRI and multiple view video images acquired on one speaker. Linear component analysis is used to model these highly deformable articulators as the weighted sum of a small number of basic shapes corresponding to the articulators' degrees of freedom for speech. These models are assembled into an audiovisual talking head that can produce augmented audiovisual speech, i.e. can display usually non visible articulators such as tongue or velum. The talking head is then animated by recovering its control parameters by inversion from the coordinates of a small number of points of the articulators of the same speaker tracked by Electro-Magnetic Articulography. The augmented speech produced points the way to promising applications in the domain of speech therapy for speech retarded children, perception and production rehabilitation of hearing impaired children, and pronunciation training for second language learners.

[1]  C. Stoel-Gammon,et al.  Prelinguistic vocalizations of hearing-impaired and normally hearing subjects: a comparison of consonantal inventories. , 1988, The Journal of speech and hearing disorders.

[2]  Philip Hoole,et al.  Electromagnetic articulography in coarticulation research , 1997 .

[3]  Marlys A. Macken,et al.  From Babbling to Speech: A Re-Assessment of the Continuity Issue , 1985 .

[4]  Gérard Bailly,et al.  Audiovisual Speech Synthesis , 2003, Int. J. Speech Technol..

[5]  Gérard Bailly,et al.  Can you 'read' tongue movements? Evaluation of the contribution of tongue display to speech understanding , 2007, Speech Commun..

[6]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[7]  D Montgomery Do dyslexics have difficulty accessing articulatory information? , 1981, Psychological research.

[8]  Anne Baker,et al.  The development of phonology in the blind child. , 1987 .

[9]  Pierre Badin,et al.  Three-dimensional linear modeling of tongue: Articulatory data and models , 2006 .

[10]  Gérard Bailly,et al.  Three-dimensional linear articulatory modeling of tongue, lips and face, based on MRI and video images , 2002, J. Phonetics.

[11]  Elliot Saltzman,et al.  The dynamical perspectives on speech production: Data and theory , 1986 .

[12]  Olle Bälter,et al.  Wizard-of-Oz test of ARTUR: a computer-based speech training system with articulation correction , 2005, Assets '05.

[13]  M H Cohen,et al.  Electromagnetic midsagittal articulometer systems for transducing speech articulatory movements. , 1992, The Journal of the Acoustical Society of America.

[14]  Joanna Light,et al.  Using visible speech to train perception and production of speech for individuals with hearing loss. , 2004, Journal of speech, language, and hearing research : JSLHR.

[15]  Julia S. Falk,et al.  The emergent lexicon : the child's development of a linguistic vocabulary , 1990 .

[16]  Christian Benoît,et al.  Audio-visual speech synthesis from French text: Eight years of models, designs and evaluation at the ICP , 1998, Speech Commun..

[17]  A. Serrurier,et al.  A three-dimensional articulatory model of the velum and nasopharyngeal wall based on MRI and CT data. , 2008, The Journal of the Acoustical Society of America.

[18]  Gérard Bailly,et al.  Degrees of freedom of facial movements in face-to-face conversational speech , 2006 .