High quality lips animation with speech and captured facial action unit as A/V input

Rendering realistic lips movements in avatar with camera captured human's facial features is desirable in many applications, e.g. telepresence, video gaming, social networking, etc. We have proposed to use Gaussian Mixture Model (GMM) to generate lips trajectory and successfully tested in speech-to-lips conversion experiments, where only audio signal (speech) is used as input. In this paper real-time user's facial features called the Action Units (AUs) well tracked by Microsoft Kinect SDK with a consumer-grade RGB camera, are combined with speech to form joint A/V input for lips animation. We test the lips ani-mation performance and show that the new combined A/V input can improve the conversion error rate by 22% in a speaker de-pendent test, compared with a baseline system.

[1]  Gwenn Englebienne,et al.  A probabilistic model for generating realistic lip movements from speech , 2007, NIPS.

[2]  Lei Xie,et al.  A coupled HMM approach to video-realistic speech animation , 2007, Pattern Recognit..

[3]  Gérard Bailly,et al.  LIPS2008: visual speech synthesis challenge , 2008, INTERSPEECH.

[4]  Ren-Hua Wang,et al.  Minimum Generation Error Training for HMM-Based Speech Synthesis , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[5]  Karsten P. Ulland,et al.  Vii. References , 2022 .

[6]  P. Ekman,et al.  Facial action coding system: a technique for the measurement of facial movement , 1978 .

[7]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Gaspard Breton,et al.  Facial animation retargeting and control based on a human appearance space , 2010 .

[9]  Renaud Séguier,et al.  Facial animation retargeting and control based on a human appearance space , 2010, Comput. Animat. Virtual Worlds.

[10]  Ricardo Gutierrez-Osuna,et al.  Audio/visual mapping with cross-modal hidden Markov models , 2005, IEEE Transactions on Multimedia.

[11]  Steven Greenberg,et al.  Speech intelligibility derived from asynchronous processing of auditory-visual information , 2001, AVSP.

[12]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[13]  Saïda Bouakaz,et al.  Feature points based facial animation retargeting , 2008, VRST '08.

[14]  Frank K. Soong,et al.  A minimum converted trajectory error (MCTE) approach to high quality speech-to-lips conversion , 2010, INTERSPEECH.