Generative Adversarial Talking Head: Bringing Portraits to Life with a Weakly Supervised Neural Network

This paper presents Generative Adversarial Talking Head (GATH), a novel deep generative neural network that enables fully automatic facial expression synthesis of an arbitrary portrait with continuous action unit (AU) coefficients. Specifically, our model directly manipulates image pixels to make the unseen subject in the still photo express various emotions controlled by values of facial AU coefficients, while maintaining her personal characteristics, such as facial geometry, skin color and hair style, as well as the original surrounding background. In contrast to prior work, GATH is purely data-driven and it requires neither a statistical face model nor image processing tricks to enact facial deformations. Additionally, our model is trained from unpaired data, where the input image, with its auxiliary identity label taken from abundance of still photos in the wild, and the target frame are from different persons. In order to effectively learn such model, we propose a novel weakly supervised adversarial learning framework that consists of a generator, a discriminator, a classifier and an action unit estimator. Our work gives rise to template-and-target-free expression editing, where still faces can be effortlessly animated with arbitrary AU coefficients provided by the user.

[1]  Lei Xie,et al.  Head motion synthesis from speech using deep neural networks , 2015, Multimedia Tools and Applications.

[2]  Tal Hassner,et al.  Effective face frontalization in unconstrained images , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Ira Kemelmacher-Shlizerman,et al.  Synthesizing Obama , 2017, ACM Trans. Graph..

[4]  Justus Thies,et al.  Face2Face: real-time face capture and reenactment of RGB videos , 2019, Commun. ACM.

[5]  Frank K. Soong,et al.  On the training aspects of Deep Neural Network (DNN) for parametric TTS synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Yiying Tong,et al.  FaceWarehouse: A 3D Facial Expression Database for Visual Computing , 2014, IEEE Transactions on Visualization and Computer Graphics.

[7]  Ziwei Liu,et al.  Semantic Facial Expression Editing using Autoencoded Flow , 2016, ArXiv.

[8]  Hai Xuan Pham,et al.  End-to-end Learning for 3D Facial Animation from Raw Waveforms of Speech , 2017, ArXiv.

[9]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[10]  Jaakko Lehtinen,et al.  Audio-driven facial animation by joint end-to-end learning of pose and emotion , 2017, ACM Trans. Graph..

[11]  Jonathon Shlens,et al.  Conditional Image Synthesis with Auxiliary Classifier GANs , 2016, ICML.

[12]  Jung-Woo Ha,et al.  StarGAN: Unified Generative Adversarial Networks for Multi-domain Image-to-Image Translation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Chao Yang,et al.  Realistic Dynamic Facial Textures from a Single Image Using GANs , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[14]  P. Ekman,et al.  Facial action coding system: a technique for the measurement of facial movement , 1978 .

[15]  Marwan Mattar,et al.  Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments , 2008 .

[16]  Fei Yang,et al.  Expression flow for 3D-aware face component transfer , 2011, ACM Trans. Graph..

[17]  Alexei A. Efros,et al.  Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[18]  Zicheng Liu,et al.  Expressive expression mapping with ratio images , 2001, SIGGRAPH.

[19]  Yisong Yue,et al.  A deep learning approach for generalized speech animation , 2017, ACM Trans. Graph..

[20]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[21]  Karel J. Zuiderveld,et al.  Contrast Limited Adaptive Histogram Equalization , 1994, Graphics Gems.

[22]  Frank K. Soong,et al.  A deep bidirectional LSTM approach for video-realistic talking head , 2016, Multimedia Tools and Applications.

[23]  Rogério Schmidt Feris,et al.  A Recurrent Encoder-Decoder Network for Sequential Face Alignment , 2016, ECCV.

[24]  Patrick Pérez,et al.  Automatic Face Reenactment , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Hai Xuan Pham,et al.  Robust real-time performance-driven 3D face tracking , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[26]  Daniel Cohen-Or,et al.  Bringing portraits to life , 2017, ACM Trans. Graph..

[27]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[28]  Christian Ledig,et al.  Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Jean-Michel Morel,et al.  Non-Local Means Denoising , 2011, Image Process. Line.

[30]  Kun Zhou,et al.  Real-time facial animation with image-based dynamic avatars , 2016, ACM Trans. Graph..

[31]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Fei Yang,et al.  Facial expression editing in video using a temporally-smooth factorization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Scott E. Reed,et al.  Weakly-supervised Disentangling with Recurrent Transformations for 3D View Synthesis , 2015, NIPS.

[34]  Hai Xuan Pham,et al.  Speech-Driven 3D Facial Animation with Implicit Emotional Awareness: A Deep Learning Approach , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[35]  Lei Xie,et al.  Realistic Mouth-Synching for Speech-Driven Talking Face Using Articulatory Modelling , 2007, IEEE Transactions on Multimedia.

[36]  Keiichi Tokuda,et al.  HMM-based text-to-audio-visual speech synthesis , 2000, INTERSPEECH.

[37]  Xiaoming Liu,et al.  Disentangled Representation Learning GAN for Pose-Invariant Face Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[39]  Hanspeter Pfister,et al.  Face transfer with multilinear models , 2005, ACM Trans. Graph..

[40]  Jun Zhu,et al.  Triple Generative Adversarial Nets , 2017, NIPS.

[41]  Matthew Turk,et al.  A Morphable Model For The Synthesis Of 3D Faces , 1999, SIGGRAPH.

[42]  Tomaso A. Poggio,et al.  Reanimating Faces in Images and Video , 2003, Comput. Graph. Forum.

[43]  Leon A. Gatys,et al.  A Neural Algorithm of Artistic Style , 2015, ArXiv.

[44]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Rob Fergus,et al.  Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks , 2015, NIPS.

[46]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[47]  Vladimir Pavlovic,et al.  Robust Real-Time 3 D Face Tracking from RGBD Videos under Extreme Pose , Depth , and Expression Variations , 2017 .

[48]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[49]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[50]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[51]  Frank K. Soong,et al.  A new language independent, photo-realistic talking head driven by voice only , 2013, INTERSPEECH.

[52]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[53]  Peter Robinson,et al.  OpenFace: An open source facial behavior analysis toolkit , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[54]  Jianfei Cai,et al.  Conditional Adversarial Synthesis of 3D Facial Action Units , 2018, Neurocomputing.

[55]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[56]  Joshua B. Tenenbaum,et al.  Deep Convolutional Inverse Graphics Network , 2015, NIPS.

[57]  Wojciech Matusik,et al.  Video face replacement , 2011, ACM Trans. Graph..

[58]  Jost Tobias Springenberg,et al.  Unsupervised and Semi-supervised Learning with Categorical Generative Adversarial Networks , 2015, ICLR.

[59]  Brian C. Lovell,et al.  Multi-Region Probabilistic Histograms for Robust and Scalable Identity Inference , 2009, ICB.

[60]  Chu-Song Chen,et al.  Face Recognition and Retrieval Using Cross-Age Reference Coding With Cross-Age Celebrity Dataset , 2015, IEEE Transactions on Multimedia.

[61]  Hai Xuan Pham,et al.  Robust Real-Time 3D Face Tracking from RGBD Videos under Extreme Pose, Depth, and Expression Variation , 2016, 2016 Fourth International Conference on 3D Vision (3DV).