论文信息 - Text Driven 3D Photo-Realistic Talking Head

Text Driven 3D Photo-Realistic Talking Head

We propose a new 3D photo-realistic talking head with a personalized, photo realistic appearance. Different head motions and facial expressions can be freely controlled and rendered. It extends our prior, high-quality, 2D photo-realistic talking head to 3D. Around 20-minutes of audio-visual 2D video are first recorded with read prompted sentences spoken by a speaker. We use a 2D-to-3D reconstruction algorithm to automatically adapt a general 3D head mesh model to the individual. In training, super feature vectors consisting of 3D geometry, texture and speech are formed to train a statistical, multi-streamed, Hidden Markov Model (HMM). The HMM is then used to synthesize both the trajectories of geometry animation and dynamic texture. The 3D talking head animation can be controlled by the rendered geometric trajectory while the facial expressions and articulator movements are rendered with the dynamic 2D image sequences. Head motions and facial expression can also be separately controlled by manipulating corresponding parameters. The new 3D talking head has many useful applications such as voice-agent, tele-presence, gaming, social networking, etc. Index Terms: audio/visual synthesis, 3D, photo-realistic, talking head

Frank K. Soong | Qiang Huo | Lijuan Wang | Wei Han

[1] Frank K. Soong,et al. A Sparse and Low-rank approach to efficient face alignment for photo-real talking head synthesis , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Zicheng Liu,et al. A Robust and Fast Face Modeling System , 2001, IEEE Pacific Rim Conference on Multimedia.

[3] Jörn Ostermann,et al. Realistic facial animation system for interactive services , 2008, INTERSPEECH.

[4] Frank K. Soong,et al. Synthesizing visual speech trajectory with minimum generation error , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Frank K. Soong,et al. A minimum converted trajectory error (MCTE) approach to high quality speech-to-lips conversion , 2010, INTERSPEECH.

[6] Harry Shum,et al. Automatic 3D face modeling from video , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[7] Hans Peter Graf,et al. Photo-Realistic Talking-Heads from Image Samples , 2000, IEEE Trans. Multim..

[8] Yuxiao Hu,et al. Automatic 3D reconstruction for face recognition , 2004, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[9] Frank K. Soong,et al. Synthesizing photo-real talking head via trajectory-guided sample selection , 2010, INTERSPEECH.

[10] Tomaso Poggio,et al. Trainable Videorealistic Speech Animation , 2004, FGR.

[11] Gérard Bailly,et al. LIPS2008: visual speech synthesis challenge , 2008, INTERSPEECH.