DNN-Based Talking Movie Generation with Face Direction Consideration

In this paper, we propose a method to generate a talking head animation considering the direction of the face. The proposed method parametrizes a facial image using the active appearance model (AAM) and models the parameters of the AAM using a feedforward deep neural network. Since the AAM is a two-dimensional face model, conventional methods that use the AAM assumes only the frontal face. Thus, when combining the generated face and other parts such as a head and a body, the direction of the face and the head was often inconsistent. The proposed method models the shape parameters of the AAM using the principal component analysis (PCA) so that the direction and movement of individual facial parts are modeled separately; thus we substitute the face direction of the generated animation with that of the head part so that the direction of the face and the head coincides. We conducted an experiment to demonstrate that the proposed method can generate face animation with proper face direction.

[1]  Ranniery Maia,et al.  Expressive visual text to speech and expression adaptation using deep neural networks , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Hiroshi Ishiguro,et al.  Analysis of relationship between head motion events and speech in dialogue conversations , 2014, Speech Communication.

[3]  Heiga Zen,et al.  Deep Learning for Acoustic Modeling in Parametric Speech Generation: A systematic review of existing techniques and future trends , 2015, IEEE Signal Processing Magazine.

[4]  Takashi Nose,et al.  Conversion of Speaker's Face Image Using PCA and Animation Unit for Video Chatting , 2015, 2015 International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIH-MSP).

[5]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[6]  Jörn Ostermann,et al.  Talking faces - technologies and applications , 2004, ICPR 2004.

[7]  Frank K. Soong,et al.  A deep bidirectional LSTM approach for video-realistic talking head , 2016, Multimedia Tools and Applications.

[8]  Timothy F. Cootes,et al.  Active Appearance Models , 1998, ECCV.

[9]  Lei Xie,et al.  A statistical parametric approach to video-realistic text-driven talking avatar , 2013, Multimedia Tools and Applications.

[10]  Takashi Nose,et al.  HMM-Based Photo-Realistic Talking Face Synthesis Using Facial Expression Parameter Mapping with Deep Neural Networks , 2017 .

[11]  Björn Stenger,et al.  Expressive Visual Text-to-Speech Using Active Appearance Models , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Ren-Hua Wang,et al.  Minimum Generation Error Training for HMM-Based Speech Synthesis , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[13]  Peter Robinson,et al.  OpenFace: An open source facial behavior analysis toolkit , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[14]  Lei Xie,et al.  Photo-real talking head with deep bidirectional LSTM , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Wesley Mattheyses,et al.  Audiovisual speech synthesis: An overview of the state-of-the-art , 2015, Speech Commun..