Control of speech-related facial movements of an avatar from video

Several puppetry techniques have been recently proposed to transfer emotional facial expressions to an avatar from a user's video. Whereas generation of facial expressions may not be sensitive to small tracking errors, generation of speech-related facial movements would be severely impaired. Since incongruent facial movements can drastically influence speech perception, we proposed a more effective method to transfer speech-related facial movements from a user to an avatar. After a facial tracking phase, speech articulatory parameters (controlling the jaw and the lips) were determined from the set of landmark positions. Two additional processes calculated the articulatory parameters which controlled the eyelids and the tongue from the 2D Discrete Cosine Transform coefficients of the eyes and inner mouth images. A speech in noise perception experiment was conducted on 25 participants to evaluate the system. Increase in intelligibility was shown for the avatar and human auditory-visual conditions compared to the avatar and human auditory-only conditions, respectively. Depending on the vocalic context, the results of the avatar auditory-visual presentation were different: all the consonants were better perceived in /a/ vocalic context compared to /i/ and /u/ because of the lack of depth information retrieved from video. This method could be used to accurately animate avatars for hearing impaired people using information technologies and telecommunication.

[1]  Gérard Bailly,et al.  Audiovisual Speech Synthesis , 2003, Int. J. Speech Technol..

[2]  E. Vatikiotis-Bateson,et al.  Kinematics-Based Synthesis of Realistic Talking Faces , 1998, AVSP.

[3]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[4]  Guillaume Gibert,et al.  Analysis and synthesis of the three-dimensional movements of the head, face, and hand of a speaker using cued speech. , 2005, The Journal of the Acoustical Society of America.

[5]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[6]  M. Fratarcangeli,et al.  A Non-Invasive Approach for Driving Virtual Talking Heads from Real Facial Movements , 2007, 2007 3DTV Conference.

[7]  Hanspeter Pfister,et al.  Face transfer with multilinear models , 2005, ACM Trans. Graph..

[8]  Shigeo Morishima,et al.  Face analysis and synthesis , 2001, IEEE Signal Process. Mag..

[9]  Jeffrey R. Spies,et al.  Effects of damping head movement and facial expression in dyadic conversation using real–time facial expression tracking and synthesized avatars , 2009, Philosophical Transactions of the Royal Society B: Biological Sciences.

[10]  Pierre Badin,et al.  Three-dimensional linear modeling of tongue: Articulatory data and models , 2006 .

[11]  Slim Ouni,et al.  Internationalization of a Talking Head , 2003 .

[12]  J. A. Johnson,et al.  Point-light facial displays enhance comprehension of speech in noise. , 1996, Journal of speech and hearing research.

[13]  Christophe Garcia,et al.  Avatar Puppetry Using Real-Time Audio and Video Analysis , 2007, IVA.

[14]  Fred Nicolls,et al.  Locating Facial Features with an Extended Active Shape Model , 2008, ECCV.

[15]  Soraia Raupp Musse,et al.  Reflecting User Faces in Avatars , 2010, IVA.

[16]  Takeo Kanade,et al.  Multi-PIE , 2008, 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition.

[17]  D. Massaro Perceiving talking faces: from speech perception to a behavioral principle , 1999 .

[18]  Slim Ouni,et al.  Visual Contribution to Speech Perception: Measuring the Intelligibility of Animated Talking Heads , 2007, EURASIP J. Audio Speech Music. Process..

[19]  A. Fort,et al.  Bimodal speech: early suppressive visual effects in human auditory cortex , 2004, The European journal of neuroscience.

[20]  Kenneth I Forster,et al.  DMDX: A Windows display program with millisecond accuracy , 2003, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[21]  Seong-Won Lee,et al.  Hierarchical active shape model with motion prediction for real-time tracking of non-rigid objects , 2007 .

[22]  C. Chibelushi,et al.  Facial Expression Recognition : A Brief Tutorial Overview , 2022 .

[23]  Simon Lucey,et al.  Real-time avatar animation from a single image , 2011, Face and Gesture 2011.

[24]  Timothy F. Cootes,et al.  Active Shape Models-Their Training and Application , 1995, Comput. Vis. Image Underst..

[25]  Gérard Bailly,et al.  MOTHER: a new generation of talking heads providing a flexible articulatory control for video-realistic speech animation , 2000, INTERSPEECH.

[26]  Luc Van Gool,et al.  Face/Off: live facial puppetry , 2009, SCA '09.

[27]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[28]  Kostas Karpouzis,et al.  Virtual agent multimodal mimicry of humans , 2007, Lang. Resour. Evaluation.

[29]  D. Massaro,et al.  Integration of facial and newly learned visual cues in speech perception. , 2011, The American journal of psychology.

[30]  Christian Benoît,et al.  Audio-visual speech synthesis from French text: Eight years of models, designs and evaluation at the ICP , 1998, Speech Commun..

[31]  Simon Lucey,et al.  Real-time avatar animation from a single image , 2011, Face and Gesture 2011.

[32]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[33]  Barry-John Theobald,et al.  Real-time expression cloning using appearance models , 2007, ICMI '07.

[34]  Jeffrey R. Spies,et al.  Mapping and Manipulating Facial Expression , 2009, Language and speech.