Lip-Synching Using Speaker-Specific Articulation, Shape and Appearance Models

We describe here the control, shape and appearance models that are built using an original photogrammetric method to capture characteristics of speaker-specific facial articulation, anatomy, and texture. Two original contributions are put forward here: the trainable trajectory formation model that predicts articulatory trajectories of a talking face from phonetic input and the texture model that computes a texture for each 3D facial shape according to articulation. Using motion capture data from different speakers and module-specific evaluation procedures, we show here that this cloning system restores detailed idiosyncrasies and the global coherence of visible articulation. Results of a subjective evaluation of the global system with competing trajectory formation models are further presented and commented.

[1]  N. F. Dixon,et al.  The Detection of Auditory Visual Desynchrony , 1980, Perception.

[2]  Guillaume Gibert,et al.  Evaluation of movement generation systems using the point-light technique , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[3]  Yohan Payan,et al.  A 3D Finite Element Model of the Face for Simulation in Plastic and Maxillo-Facial Surgery , 2000, MICCAI.

[4]  Gérard Bailly,et al.  MOTHER: a new generation of talking heads providing a flexible articulatory control for video-realistic speech animation , 2000, INTERSPEECH.

[5]  Gérard Bailly,et al.  Learning optimal audiovisual phasing for an HMM-based control model for facial animation , 2007, SSW.

[6]  Timothy J. Hazen Visual model structures and synchrony constraints for audio-visual speech recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Tony Ezzat,et al.  MikeTalk: a talking facial display based on morphing visemes , 1998, Proceedings Computer Animation '98 (Cat. No.98EX169).

[8]  Tomaso Poggio,et al.  Trainable Videorealistic Speech Animation , 2004, FGR.

[9]  Gérard Bailly,et al.  LIPS2008: visual speech synthesis challenge , 2008, INTERSPEECH.

[10]  T. Kaburagi,et al.  Articulatory movement formation by kinematic triphone model , 1999, IEEE SMC'99 Conference Proceedings. 1999 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No.99CH37028).

[11]  Heiga Zen,et al.  An HMM-based singing voice synthesis system , 2006, INTERSPEECH.

[12]  Gérard Bailly,et al.  A new trainable trajectory formation system for facial animation , 2006, ExLing.

[13]  Gérard Bailly,et al.  Audiovisual Speech Synthesis , 2003, Int. J. Speech Technol..

[14]  Gérard Bailly,et al.  Degrees of freedom of facial movements in face-to-face conversational speech , 2006 .

[15]  D. Whalen Coarticulation is largely planned , 1990 .

[16]  Simon King,et al.  Festival 2 - build your own general purpose unit selection speech synthesiser , 2004, SSW.

[17]  Heiga Zen,et al.  An introduction of trajectory model into HMM-based speech synthesis , 2004, SSW.

[18]  Gérard Bailly,et al.  Creating and controlling video-realistic talking heads , 2001, AVSP.

[19]  Gérard Bailly,et al.  Speaking with smile or disgust: data and models , 2008, AVSP.

[20]  Jonas Beskow,et al.  Resynthesis of 3d tongue movements from facial data , 2003, INTERSPEECH.

[21]  Thierry Guiard-Marigny,et al.  3D Models of the Lips and Jaw for Visual Speech Synthesis , 1997 .

[22]  K. Tokuda,et al.  Speech parameter generation from HMM using dynamic features , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[23]  Takao Kobayashi,et al.  Text-to-audio-visual speech synthesis based on parameter generation from HMM , 1999, EUROSPEECH.

[24]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[25]  Demetri Terzopoulos,et al.  Analysis and Synthesis of Facial Image Sequences Using Physical and Anatomical Models , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  Keiichi Tokuda,et al.  Text-to-visual speech synthesis based on parameter generation from HMM , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[27]  Hans Peter Graf,et al.  Sample-based synthesis of photo-realistic talking heads , 1998, Proceedings Computer Animation '98 (Cat. No.98EX169).

[28]  Piero Pierucci,et al.  Phonetic ergodic HMM for speech synthesis , 1991, EUROSPEECH.

[29]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[30]  Gérard Bailly,et al.  Three-dimensional linear articulatory modeling of tongue, lips and face, based on MRI and video images , 2002, J. Phonetics.

[31]  Timothy F. Cootes,et al.  Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  Gavin C. Cawley,et al.  Visual speech synthesis using statistical models of shape and appearance , 2001, AVSP.

[33]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[34]  P. Ekman,et al.  What the face reveals : basic and applied studies of spontaneous expression using the facial action coding system (FACS) , 2005 .

[35]  Gérard Bailly,et al.  An Audiovisual Talking Head for Augmented Speech Generation: Models and Animations Based on a Real Speaker's Articulatory Data , 2008, AMDO.