Realtime speech-driven facial animation using Gaussian Mixture Models

Synthesizing speech-driven facial animation is the process of animating a virtual face according to the input audio signal. Actually, audio-to-visual conversion is the core of speech-driven facial animation. In this paper, Gaussian Mixture Models (GMM) are employed for audio-to-visual conversion. The conventional GMM based method performs the conversion frame by frame using minimum mean square error estimation. We consider two issues related to the conventional method: the influence of previous visual features on current visual feature is not considered, and GMM training and conversion are inconsistent. To address these issues, we propose incorporating previous visual features into the conversion. We also propose a minimum conversion error based approach to refine the GMM parameters. Experiments on a public available database show that our method can accurately convert audio features into visual features. The conversion accuracy is comparable to a current state-of-the-art trajectory-based approach. Based on the proposed method, we develop a speech-driven facial animation system, the system runs in real time and outputs realistic speech animations.

[1]  Tomaso Poggio,et al.  Trainable Videorealistic Speech Animation , 2004, FGR.

[2]  Gérard Bailly,et al.  LIPS2008: visual speech synthesis challenge , 2008, INTERSPEECH.

[3]  Keiichi Tokuda,et al.  Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model , 2008, Speech Commun..

[4]  Thomas S. Huang,et al.  Real-time speech-driven face animation with expressions using neural networks , 2002, IEEE Trans. Neural Networks.

[5]  Jorge Nocedal,et al.  A Limited Memory Algorithm for Bound Constrained Optimization , 1995, SIAM J. Sci. Comput..

[6]  Xuelong Li,et al.  A Review of Active Appearance Models , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[7]  Hao Li,et al.  Realtime performance-based facial animation , 2011, ACM Trans. Graph..

[8]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[9]  György Takács Direct, modular and hybrid audio to visual speech conversion methods - a comparative study , 2009, INTERSPEECH.

[10]  Frank K. Soong,et al.  High quality lip-sync animation for 3D photo-realistic talking head , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Kun Zhou,et al.  3D shape regression for real-time facial animation , 2013, ACM Trans. Graph..

[12]  A. Murat Tekalp,et al.  Face and 2-D mesh animation in MPEG-4 , 2000, Signal Process. Image Commun..

[13]  Qionghai Dai,et al.  A data-driven approach for facial expression synthesis in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.