论文信息 - Lip-Synching Using Speaker-Specific Articulation, Shape and Appearance Models

Lip-Synching Using Speaker-Specific Articulation, Shape and Appearance Models

We describe here the control, shape and appearance models that are built using an original photogrammetric method to capture characteristics of speaker-specific facial articulation, anatomy, and texture. Two original contributions are put forward here: the trainable trajectory formation model that predicts articulatory trajectories of a talking face from phonetic input and the texture model that computes a texture for each 3D facial shape according to articulation. Using motion capture data from different speakers and module-specific evaluation procedures, we show here that this cloning system restores detailed idiosyncrasies and the global coherence of visible articulation. Results of a subjective evaluation of the global system with competing trajectory formation models are further presented and commented.

Gérard Bailly | Frédéric Elisei | Gaspard Breton | Oxana Govokhina

[1] N. F. Dixon,et al. The Detection of Auditory Visual Desynchrony , 1980, Perception.

[2] Guillaume Gibert,et al. Evaluation of movement generation systems using the point-light technique , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[3] Yohan Payan,et al. A 3D Finite Element Model of the Face for Simulation in Plastic and Maxillo-Facial Surgery , 2000, MICCAI.

[4] Gérard Bailly,et al. MOTHER: a new generation of talking heads providing a flexible articulatory control for video-realistic speech animation , 2000, INTERSPEECH.

[5] Gérard Bailly,et al. Learning optimal audiovisual phasing for an HMM-based control model for facial animation , 2007, SSW.

[6] Timothy J. Hazen. Visual model structures and synchrony constraints for audio-visual speech recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[7] Tony Ezzat,et al. MikeTalk: a talking facial display based on morphing visemes , 1998, Proceedings Computer Animation '98 (Cat. No.98EX169).

[8] Tomaso Poggio,et al. Trainable Videorealistic Speech Animation , 2004, FGR.

[9] Gérard Bailly,et al. LIPS2008: visual speech synthesis challenge , 2008, INTERSPEECH.

[10] T. Kaburagi,et al. Articulatory movement formation by kinematic triphone model , 1999, IEEE SMC'99 Conference Proceedings. 1999 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No.99CH37028).

[11] Heiga Zen,et al. An HMM-based singing voice synthesis system , 2006, INTERSPEECH.

[12] Gérard Bailly,et al. A new trainable trajectory formation system for facial animation , 2006, ExLing.

[13] Gérard Bailly,et al. Audiovisual Speech Synthesis , 2003, Int. J. Speech Technol..

[14] Gérard Bailly,et al. Degrees of freedom of facial movements in face-to-face conversational speech , 2006 .

[15] D. Whalen. Coarticulation is largely planned , 1990 .

[16] Simon King,et al. Festival 2 - build your own general purpose unit selection speech synthesiser , 2004, SSW.

[17] Heiga Zen,et al. An introduction of trajectory model into HMM-based speech synthesis , 2004, SSW.

[18] Gérard Bailly,et al. Creating and controlling video-realistic talking heads , 2001, AVSP.

[19] Gérard Bailly,et al. Speaking with smile or disgust: data and models , 2008, AVSP.

[20] Jonas Beskow,et al. Resynthesis of 3d tongue movements from facial data , 2003, INTERSPEECH.

[21] Thierry Guiard-Marigny,et al. 3D Models of the Lips and Jaw for Visual Speech Synthesis , 1997 .

[22] K. Tokuda,et al. Speech parameter generation from HMM using dynamic features , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[23] Takao Kobayashi,et al. Text-to-audio-visual speech synthesis based on parameter generation from HMM , 1999, EUROSPEECH.

[24] Keiichi Tokuda,et al. Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[25] Demetri Terzopoulos,et al. Analysis and Synthesis of Facial Image Sequences Using Physical and Anatomical Models , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[26] Keiichi Tokuda,et al. Text-to-visual speech synthesis based on parameter generation from HMM , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[27] Hans Peter Graf,et al. Sample-based synthesis of photo-realistic talking heads , 1998, Proceedings Computer Animation '98 (Cat. No.98EX169).

[28] Piero Pierucci,et al. Phonetic ergodic HMM for speech synthesis , 1991, EUROSPEECH.

[29] Christoph Bregler,et al. Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[30] Gérard Bailly,et al. Three-dimensional linear articulatory modeling of tongue, lips and face, based on MRI and video images , 2002, J. Phonetics.

[31] Timothy F. Cootes,et al. Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[32] Gavin C. Cawley,et al. Visual speech synthesis using statistical models of shape and appearance , 2001, AVSP.

[33] H. McGurk,et al. Hearing lips and seeing voices , 1976, Nature.

[34] P. Ekman,et al. What the face reveals : basic and applied studies of spontaneous expression using the facial action coding system (FACS) , 2005 .

[35] Gérard Bailly,et al. An Audiovisual Talking Head for Augmented Speech Generation: Models and Animations Based on a Real Speaker's Articulatory Data , 2008, AMDO.