A speech driven talking head system based on a single face image

In this paper, a lifelike talking head system is proposed. The talking head, which is driven by speaker independent speech recognition, requires only one single face image to synthesize lifelike facial expression. The proposed system uses speech recognition engines to get utterances and corresponding time stamps in the speech data. Associated facial expressions can be fetched from an expression pool and the synthetic facial expression can then be synchronized with speech. When applied to Internet, our web-enabled talking head system can be a vivid merchandise narrator, and only requires 50 K bytes/minute with an additional face image (about 40 Kbytes in CIF format, 24 bit-color, JPEG compression). The system can synthesize facial animation more than 30 frames/sec on a Pentium II 266 MHz PC.

[1]  Nadia Magnenat-Thalmann,et al.  Automatic 3D cloning and real-time animation of a human face , 1997, Proceedings. Computer Animation '97 (Cat. No.97TB100120).

[2]  Nira Dyn,et al.  Image Warping by Radial Basis Functions: Application to Facial Expressions , 1994, CVGIP Graph. Model. Image Process..

[3]  Ming Ouhyoung,et al.  Image Talk: a real time synthetic talking head using one single image with Chinese text-to-speech capability , 1998, Proceedings Pacific Graphics '98. Sixth Pacific Conference on Computer Graphics and Applications (Cat. No.98EX208).

[4]  Matthew Stone,et al.  An anthropometric face model using variational techniques , 1998, SIGGRAPH.

[5]  Demetri Terzopoulos,et al.  Analysis and Synthesis of Facial Image Sequences Using Physical and Anatomical Models , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  David Salesin,et al.  Synthesizing realistic facial expressions from photographs , 1998, SIGGRAPH.

[7]  Tony DeRose,et al.  Subdivision surfaces in character animation , 1998, SIGGRAPH.

[8]  Michael M. Cohen,et al.  Modeling Coarticulation in Synthetic Visual Speech , 1993 .

[9]  Chiu-yu Tseng,et al.  The synthesis rules in a Chinese text-to-speech system , 1989, IEEE Trans. Acoust. Speech Signal Process..

[10]  Hiroshi Harashima,et al.  A Media Conversion from Speech to Facial Image for Intelligent Man-Machine Interface , 1991, IEEE J. Sel. Areas Commun..

[11]  Hans Peter Graf,et al.  Sample-based synthesis of photo-realistic talking heads , 1998, Proceedings Computer Animation '98 (Cat. No.98EX169).

[12]  Thaddeus Beier,et al.  Feature-based image metamorphosis , 1992, SIGGRAPH.

[13]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[14]  Steven M. Seitz,et al.  View morphing , 1996, SIGGRAPH.

[15]  P. Ekman,et al.  Facial action coding system: a technique for the measurement of facial movement , 1978 .

[16]  Jörn Ostermann,et al.  Animation of synthetic faces in MPEG-4 , 1998, Proceedings Computer Animation '98 (Cat. No.98EX169).

[17]  Tzong-Jer Yang,et al.  Speech Driven Facial Animation , 1999, Computer Animation and Simulation.