论文信息 - Humanoid Audio–Visual Avatar With Emotive Text-to-Speech Synthesis

Humanoid Audio–Visual Avatar With Emotive Text-to-Speech Synthesis

Emotive audio-visual avatars are virtual computer agents which have the potential of improving the quality of human-machine interaction and human-human communication significantly. However, the understanding of human communication has not yet advanced to the point where it is possible to make realistic avatars that demonstrate interactions with natural-sounding emotive speech and realistic-looking emotional facial expressions. In this paper, We propose the various technical approaches of a novel multimodal framework leading to a text-driven emotive audio-visual avatar. Our primary work is focused on emotive speech synthesis, realistic emotional facial expression animation, and the co-articulation between speech gestures (i.e., lip movements) and facial expressions. A general framework of emotive text-to-speech (TTS) synthesis using a diphone synthesizer is designed and integrated into a generic 3-D avatar face model. Under the guidance of this framework, we therefore developed a realistic 3-D avatar prototype. A rule-based emotive TTS synthesis system module based on the Festival-MBROLA architecture has been designed to demonstrate the effectiveness of the framework design. Subjective listening experiments were carried out to evaluate the expressiveness of the synthetic talking avatar.

Yun Fu | Thomas S. Huang | Mark Hasegawa-Johnson | Hao Tang | Jilin Tu

[1] Michael Picheny,et al. The IBM expressive text-to-speech synthesis system for American English , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[2] 飯田朱美. A study on corpus-based speech synthesis with emotion , 2002 .

[3] Frederick I. Parke,et al. Computer gernerated animation of faces , 1998 .

[4] Felix Burkhardt,et al. Simulation emotionaler Sprechweise mit Sprachsyntheseverfahren , 2000 .

[5] Yun Fu,et al. EAVA: A 3D Emotive Audio-Visual Avatar , 2008, 2008 IEEE Workshop on Applications of Computer Vision.

[6] Thomas S. Huang,et al. Final Report To NSF of the Planning Workshop on Facial Expression Understanding , 1992 .

[7] M. Schröder. CAN EMOTIONS BE SYNTHESIZED WITHOUT CONTROLLING VOICE QUALITY , 1999 .

[8] Janet E. Cahn. Generating expression in synthesized speech , 1989 .

[9] Gregor Hofer,et al. Emotional Speech Synthesis , 2004 .

[10] Nicu Sebe,et al. Human-centered computing: a multimedia perspective , 2006, MM '06.

[11] Thomas S. Huang,et al. Explanation-based facial motion tracking using a piecewise Bezier volume deformation model , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[12] N. Zheng,et al. M-Face: An Appearance-Based Photorealistic Model for Multiple Facial Attributes Rendering , 2006, IEEE Transactions on Circuits and Systems for Video Technology.

[13] Thomas S. Huang,et al. 3D Face Processing: Modeling, Analysis and Synthesis , 2004 .

[14] Marc Schröder,et al. Expressing vocal effort in concatenative synthesis , 2003 .

[15] Frederick I. Parke,et al. Computer generated animation of faces , 1972, ACM Annual Conference.

[16] S. Demleitner. [Communication without words]. , 1997, Pflege aktuell.

[17] William Robert Lee,et al. MPEG-4 Facial Animation , 2004 .

[18] Ronald A. Cole,et al. Accurate visible speech synthesis based on concatenating variable length motion capture data , 2006, IEEE Transactions on Visualization and Computer Graphics.

[19] Nadia Magnenat-Thalmann,et al. Principal components of expressive speech animation , 2001, Proceedings. Computer Graphics International 2001.

[20] Joseph A. Paradiso,et al. PingPongPlus: design of an athletic-tangible interface for computer-supported cooperative play , 1999, CHI '99.

[21] Matthew Turk,et al. A Morphable Model For The Synthesis Of 3D Faces , 1999, SIGGRAPH.

[22] Jenq-Neng Hwang,et al. Constrained optimization for audio-to-visual conversion , 2004, IEEE Transactions on Signal Processing.

[23] Javier Macías Guarasa,et al. Development of an emotional speech synthesiser in Spanish , 1999, EUROSPEECH.

[24] Thomas S. Huang,et al. iFACE: A 3D Synthetic Talking Face , 2001, Int. J. Image Graph..

[25] Yun Fu,et al. Real-Time Multimodal Human–Avatar Interaction , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[26] Maurizio Mancini,et al. A Virtual Head Driven by Music Expressivity , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[27] Thomas S. Huang,et al. Real-time speech-driven face animation with expressions using neural networks , 2002, IEEE Trans. Neural Networks.

[28] Frédéric H. Pighin,et al. Expressive speech-driven facial animation , 2005, TOGS.

[29] Felix Burkhardt,et al. Emofilt: the simulation of emotional speech by prosody-transformation , 2005, INTERSPEECH.

[30] Yun Fu,et al. Real-Time Humanoid Avatar for Multimodal Human-Machine Interaction , 2007, 2007 IEEE International Conference on Multimedia and Expo.