VIDEO REALISTIC TALKING HEADS USING HIERARCHICAL NON-LINEAR SPEECH-APPEARANCE MODELS

In this paper we present an audio driven system capable of videorealistic synthesis of a speaker uttering novel phrases. The audio input signal requires no phonetic labelling and is speaker independent. The system requires only a small training set of video and produces fully co-articulated realistic facial synthesis. Natural mouth and face dynamics are learned in training to allow new facial poses, unseen in the training video, to be rendered. To improve specificity and synthesis quality the appearance of a speaker’s mouth and face are modelled separately and combined to produce the final video. To achieve this we have developed a novel approach which utilizes a hierarchical and non-linear PCA model which couples speech and appearance. The model is highly compact making it suitable for a wide range of real-time applications in multimedia and telecommunications using standard hardware.

[1]  Frederic I. Parke,et al.  A parametric model for human faces. , 1974 .

[2]  Tony Ezzat,et al.  Trainable videorealistic speech animation , 2002, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[3]  Jörgen Ahlberg,et al.  CANDIDE-3 - An Updated Parameterised Face , 2001 .

[4]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[5]  Gérard Bailly,et al.  MOTHER: a new generation of talking heads providing a flexible articulatory control for video-realistic speech animation , 2000, INTERSPEECH.

[6]  Timothy F. Cootes,et al.  Active Appearance Models , 1998, ECCV.

[7]  Timothy F. Cootes,et al.  Learning to identify and track faces in image sequences , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[8]  David C. Hogg,et al.  Reactive Memories: An Interactive Talking-Head , 2001, BMVC.

[9]  David C. Hogg,et al.  Improving Specificity in PDMs using a Hierarchical Approach , 1997, BMVC.

[10]  Gavin C. Cawley,et al.  Towards a low bandwidth talking face using appearance models , 2003, Image Vis. Comput..

[11]  Tony Ezzat,et al.  Facial analysis and synthesis using image-based models , 1996, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition.

[12]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[13]  Timothy F. Cootes,et al.  Active Shape Models-Their Training and Application , 1995, Comput. Vis. Image Underst..

[14]  Bertrand Le Goff,et al.  A text-to-audiovisual-speech synthesizer for French , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[15]  Tony Ezzat,et al.  MikeTalk: a talking facial display based on morphing visemes , 1998, Proceedings Computer Animation '98 (Cat. No.98EX169).

[16]  Luc Van Gool,et al.  Face animation based on observed 3D speech dynamics , 2001, Proceedings Computer Animation 2001. Fourteenth Conference on Computer Animation (Cat. No.01TH8596).

[17]  Timothy F. Cootes,et al.  A mixture model for representing shape variation , 1999, Image Vis. Comput..

[18]  Keith Waters,et al.  Computer facial animation , 1996 .

[19]  Richard Bowden,et al.  Learning Non-linear Models of Shape and Motion , 1999 .

[20]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.