Speech-driven facial animation using a hierarchical model

A system capable of producing near video-realistic animation of a speaker given only speech inputs is presented. The audio input is a continuous speech signal, requires no phonetic labelling and is speaker-independent. The system requires only a short video training corpus of a subject speaking a list of viseme-targeted words in order to achieve convincing realistic facial synthesis. The system learns the natural mouth and face dynamics of a speaker to allow new facial poses, unseen in the training video, to be synthesised. To achieve this the authors have developed a novel approach which utilises a hierarchical and nonlinear principal components analysis (PCA) model which couples speech and appearance. Animation of different facial areas, defined by the hierarchy, is performed separately and merged in post-processing using an algorithm which combines texture and shape PCA data. It is shown that the model is capable of synthesising videos of a speaker using new audio segments from both previously heard and unheard speakers.

[1]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[2]  Timothy F. Cootes,et al.  Active Shape Models-Their Training and Application , 1995, Comput. Vis. Image Underst..

[3]  Jonas Beskow,et al.  Rule-based visual speech synthesis , 1995, EUROSPEECH.

[4]  F. I. Parke June,et al.  Computer Generated Animation of Faces , 1972 .

[5]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[6]  Gregor Arthur Kalberer Realistic face animation for speech , 2003 .

[7]  Timothy F. Cootes,et al.  A mixture model for representing shape variation , 1999, Image Vis. Comput..

[8]  Tsuhan Chen,et al.  Real-time lip-synch face animation driven by human voice , 1998, 1998 IEEE Second Workshop on Multimedia Signal Processing (Cat. No.98EX175).

[9]  Luc Van Gool,et al.  Face animation based on observed 3D speech dynamics , 2001, Proceedings Computer Animation 2001. Fourteenth Conference on Computer Animation (Cat. No.01TH8596).

[10]  Gavin C. Cawley,et al.  2.5D Visual Speech Synthesis Using Appearance Models , 2003, BMVC.

[11]  Hans Peter Graf,et al.  Photo-Realistic Talking-Heads from Image Samples , 2000, IEEE Trans. Multim..

[12]  Gérard Bailly,et al.  MOTHER: a new generation of talking heads providing a flexible articulatory control for video-realistic speech animation , 2000, INTERSPEECH.

[13]  Juergen Luettin,et al.  Speechreading using Probabilistic Models , 1997, Comput. Vis. Image Underst..

[14]  Thomas S. Huang,et al.  Real-time speech-driven face animation with expressions using neural networks , 2002, IEEE Trans. Neural Networks.

[15]  Timothy F. Cootes,et al.  Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Richard Bowden,et al.  Learning Non-linear Models of Shape and Motion , 1999 .

[17]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[18]  A. David Marshall,et al.  Tracking people in three dimensions using a hierarchical model of dynamics , 2002, Image Vis. Comput..

[19]  Bertrand Le Goff,et al.  A text-to-audiovisual-speech synthesizer for French , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[20]  Tomaso Poggio,et al.  Trainable Videorealistic Speech Animation , 2004, FGR.