A Study on 2D Photo-Realistic Facial Animation Generation Using 3D Facial Feature Points and Deep Neural Networks

This paper proposes a technique for generating a 2D photo-realistic facial animation from an input text. The technique is based on the mapping from 3D facial feature points with deep neural networks (DNNs). Our previous approach was based only on a 2D space using hidden Markov models (HMMs) and DNNs. However, this approach has a disadvantage that generated 2D facial pixels are sensitive to the rotation of the face in the training data. In this study, we alleviate the problem using 3D facial feature points obtained by Kinect. The information of the face shape and color is parameterized by the 3D facial feature points. The relation between the labels from texts and face-model parameters are modeled by DNNs in the model training. As a preliminary experiment, we show that the proposed technique can generate the 2D facial animation from arbitrary input texts.

[1]  Keiichi Tokuda,et al.  HMM-based text-to-audio-visual speech synthesis , 2000, INTERSPEECH.

[2]  Takashi Nose,et al.  HMM-Based Style Control for Expressive Speech Synthesis with Arbitrary Speaker's Voice Using Model Adaptation , 2009, IEICE Trans. Inf. Syst..

[3]  Takashi Nose,et al.  Efficient Implementation of Global Variance Compensation for Parametric Speech Synthesis , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Takashi Nose,et al.  Synthesis of Photo-Realistic Facial Animation from Text Based on HMM and DNN with Animation Unit , 2017 .

[5]  Timothy F. Cootes,et al.  Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Mark J. F. Gales Cluster adaptive training of hidden Markov models , 2000, IEEE Trans. Speech Audio Process..

[8]  Shigeru Katagiri,et al.  ATR Japanese speech database as a tool of speech recognition and synthesis , 1990, Speech Commun..

[9]  Zhengyou Zhang,et al.  Microsoft Kinect Sensor and Its Effect , 2012, IEEE Multim..

[10]  Takashi Nose,et al.  A Style Control Technique for HMM-Based Expressive Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[11]  Björn Stenger,et al.  Expressive Visual Text-to-Speech Using Active Appearance Models , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Paul J. Besl,et al.  Method for registration of 3-D shapes , 1992, Other Conferences.