Humanoid Audio–Visual Avatar With Emotive Text-to-Speech Synthesis

Emotive audio-visual avatars are virtual computer agents which have the potential of improving the quality of human-machine interaction and human-human communication significantly. However, the understanding of human communication has not yet advanced to the point where it is possible to make realistic avatars that demonstrate interactions with natural-sounding emotive speech and realistic-looking emotional facial expressions. In this paper, We propose the various technical approaches of a novel multimodal framework leading to a text-driven emotive audio-visual avatar. Our primary work is focused on emotive speech synthesis, realistic emotional facial expression animation, and the co-articulation between speech gestures (i.e., lip movements) and facial expressions. A general framework of emotive text-to-speech (TTS) synthesis using a diphone synthesizer is designed and integrated into a generic 3-D avatar face model. Under the guidance of this framework, we therefore developed a realistic 3-D avatar prototype. A rule-based emotive TTS synthesis system module based on the Festival-MBROLA architecture has been designed to demonstrate the effectiveness of the framework design. Subjective listening experiments were carried out to evaluate the expressiveness of the synthetic talking avatar.

[1]  Michael Picheny,et al.  The IBM expressive text-to-speech synthesis system for American English , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  飯田 朱美 A study on corpus-based speech synthesis with emotion , 2002 .

[3]  Frederick I. Parke,et al.  Computer gernerated animation of faces , 1998 .

[4]  Felix Burkhardt,et al.  Simulation emotionaler Sprechweise mit Sprachsyntheseverfahren , 2000 .

[5]  Yun Fu,et al.  EAVA: A 3D Emotive Audio-Visual Avatar , 2008, 2008 IEEE Workshop on Applications of Computer Vision.

[6]  Thomas S. Huang,et al.  Final Report To NSF of the Planning Workshop on Facial Expression Understanding , 1992 .

[7]  M. Schröder CAN EMOTIONS BE SYNTHESIZED WITHOUT CONTROLLING VOICE QUALITY , 1999 .

[8]  Janet E. Cahn Generating expression in synthesized speech , 1989 .

[9]  Gregor Hofer,et al.  Emotional Speech Synthesis , 2004 .

[10]  Nicu Sebe,et al.  Human-centered computing: a multimedia perspective , 2006, MM '06.

[11]  Thomas S. Huang,et al.  Explanation-based facial motion tracking using a piecewise Bezier volume deformation model , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[12]  N. Zheng,et al.  M-Face: An Appearance-Based Photorealistic Model for Multiple Facial Attributes Rendering , 2006, IEEE Transactions on Circuits and Systems for Video Technology.

[13]  Thomas S. Huang,et al.  3D Face Processing: Modeling, Analysis and Synthesis , 2004 .

[14]  Marc Schröder,et al.  Expressing vocal effort in concatenative synthesis , 2003 .

[15]  Frederick I. Parke,et al.  Computer generated animation of faces , 1972, ACM Annual Conference.

[16]  S. Demleitner [Communication without words]. , 1997, Pflege aktuell.

[17]  William Robert Lee,et al.  MPEG-4 Facial Animation , 2004 .

[18]  Ronald A. Cole,et al.  Accurate visible speech synthesis based on concatenating variable length motion capture data , 2006, IEEE Transactions on Visualization and Computer Graphics.

[19]  Nadia Magnenat-Thalmann,et al.  Principal components of expressive speech animation , 2001, Proceedings. Computer Graphics International 2001.

[20]  Joseph A. Paradiso,et al.  PingPongPlus: design of an athletic-tangible interface for computer-supported cooperative play , 1999, CHI '99.

[21]  Matthew Turk,et al.  A Morphable Model For The Synthesis Of 3D Faces , 1999, SIGGRAPH.

[22]  Jenq-Neng Hwang,et al.  Constrained optimization for audio-to-visual conversion , 2004, IEEE Transactions on Signal Processing.

[23]  Javier Macías Guarasa,et al.  Development of an emotional speech synthesiser in Spanish , 1999, EUROSPEECH.

[24]  Thomas S. Huang,et al.  iFACE: A 3D Synthetic Talking Face , 2001, Int. J. Image Graph..

[25]  Yun Fu,et al.  Real-Time Multimodal Human–Avatar Interaction , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[26]  Maurizio Mancini,et al.  A Virtual Head Driven by Music Expressivity , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Thomas S. Huang,et al.  Real-time speech-driven face animation with expressions using neural networks , 2002, IEEE Trans. Neural Networks.

[28]  Frédéric H. Pighin,et al.  Expressive speech-driven facial animation , 2005, TOGS.

[29]  Felix Burkhardt,et al.  Emofilt: the simulation of emotional speech by prosody-transformation , 2005, INTERSPEECH.

[30]  Yun Fu,et al.  Real-Time Humanoid Avatar for Multimodal Human-Machine Interaction , 2007, 2007 IEEE International Conference on Multimedia and Expo.