论文信息 - Speech-driven facial animation using a hierarchical model

Speech-driven facial animation using a hierarchical model

A system capable of producing near video-realistic animation of a speaker given only speech inputs is presented. The audio input is a continuous speech signal, requires no phonetic labelling and is speaker-independent. The system requires only a short video training corpus of a subject speaking a list of viseme-targeted words in order to achieve convincing realistic facial synthesis. The system learns the natural mouth and face dynamics of a speaker to allow new facial poses, unseen in the training video, to be synthesised. To achieve this the authors have developed a novel approach which utilises a hierarchical and nonlinear principal components analysis (PCA) model which couples speech and appearance. Animation of different facial areas, defined by the hierarchy, is performed separately and merged in post-processing using an algorithm which combines texture and shape PCA data. It is shown that the model is capable of synthesising videos of a speaker using new audio segments from both previously heard and unheard speakers.

[1] Christoph Bregler,et al. Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[2] Timothy F. Cootes,et al. Active Shape Models-Their Training and Application , 1995, Comput. Vis. Image Underst..

[3] Jonas Beskow,et al. Rule-based visual speech synthesis , 1995, EUROSPEECH.

[4] F. I. Parke June,et al. Computer Generated Animation of Faces , 1972 .

[5] John H. L. Hansen,et al. Discrete-Time Processing of Speech Signals , 1993 .

[6] Gregor Arthur Kalberer. Realistic face animation for speech , 2003 .

[7] Timothy F. Cootes,et al. A mixture model for representing shape variation , 1999, Image Vis. Comput..

[8] Tsuhan Chen,et al. Real-time lip-synch face animation driven by human voice , 1998, 1998 IEEE Second Workshop on Multimedia Signal Processing (Cat. No.98EX175).

[9] Luc Van Gool,et al. Face animation based on observed 3D speech dynamics , 2001, Proceedings Computer Animation 2001. Fourteenth Conference on Computer Animation (Cat. No.01TH8596).

[10] Gavin C. Cawley,et al. 2.5D Visual Speech Synthesis Using Appearance Models , 2003, BMVC.

[11] Hans Peter Graf,et al. Photo-Realistic Talking-Heads from Image Samples , 2000, IEEE Trans. Multim..

[12] Gérard Bailly,et al. MOTHER: a new generation of talking heads providing a flexible articulatory control for video-realistic speech animation , 2000, INTERSPEECH.

[13] Juergen Luettin,et al. Speechreading using Probabilistic Models , 1997, Comput. Vis. Image Underst..

[14] Thomas S. Huang,et al. Real-time speech-driven face animation with expressions using neural networks , 2002, IEEE Trans. Neural Networks.

[15] Timothy F. Cootes,et al. Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[16] Richard Bowden,et al. Learning Non-linear Models of Shape and Motion , 1999 .

[17] Matthew Brand,et al. Voice puppetry , 1999, SIGGRAPH.

[18] A. David Marshall,et al. Tracking people in three dimensions using a hierarchical model of dynamics , 2002, Image Vis. Comput..

[19] Bertrand Le Goff,et al. A text-to-audiovisual-speech synthesizer for French , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[20] Tomaso Poggio,et al. Trainable Videorealistic Speech Animation , 2004, FGR.