Creating a Speech Enabled Avatar from a Single Photograph

This paper presents a complete framework for creating a speech-enabled avatar from a single image of a person. Our approach uses a generic facial motion model which represents deformations of a prototype face during speech. We have developed an HMM-based facial animation algorithm which takes into account both lexical stress and coarticulation. This algorithm produces realistic animations of the prototype facial surface from either text or speech. The generic facial motion model can be transformed to a novel face geometry using a set of corresponding points between the prototype face surface and the novel face. Given a face photograph, a small number of manually selected features in the photograph are used to deform the prototype face surface. The deformed surface is then used to animate the face in the photograph. We show several examples of avatars that are driven by text and speech inputs.

[1]  Keiichi Tokuda,et al.  Text-to-visual speech synthesis based on parameter generation from HMM , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[2]  Sung Yong Shin,et al.  On pixel-based texture synthesis by non-parametric sampling , 2006, Comput. Graph..

[3]  Weston Kosova,et al.  Crazy talk. , 2009, Newsweek.

[4]  F. I. Parke June,et al.  Computer Generated Animation of Faces , 1972 .

[5]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[6]  Demetri Terzopoulos,et al.  Analysis and Synthesis of Facial Image Sequences Using Physical and Anatomical Models , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Frederick I. Parke,et al.  Computer gernerated animation of faces , 1998 .

[8]  Jack Sklansky,et al.  Finding circles by an array of accumulators , 1975, Commun. ACM.

[9]  Norman I. Badler,et al.  Eyes alive , 2002, ACM Trans. Graph..

[10]  Fred L. Bookstein,et al.  Principal Warps: Thin-Plate Splines and the Decomposition of Deformations , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Michael I. Miller,et al.  Head-pose and illumination invariant three-dimensional audio-visual speech recognition , 2007 .

[12]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[13]  Steve Young,et al.  The HTK book , 1995 .

[14]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[15]  Tomaso Poggio,et al.  Trainable Videorealistic Speech Animation , 2004, FGR.

[16]  Eddie Kohler,et al.  Real-time speech motion synthesis from recorded motions , 2004, SCA '04.

[17]  Tomaso A. Poggio,et al.  Reanimating Faces in Images and Video , 2003, Comput. Graph. Forum.

[18]  Li Zhang,et al.  Dynamic, expressive speech animation from a single mesh , 2007, SCA '07.

[19]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[20]  David Poeppel,et al.  Detection of auditory (cross-spectral) and auditory-visual (cross-modal) synchrony , 2004, Speech Commun..

[21]  Pascal Müller,et al.  Realistic speech animation based on observed 3-D face dynamics , 2005 .

[22]  Shree K. Nayar,et al.  Speech Enabled Avatar from a Single Photograph , 2007 .