Geometry-Guided Dense Perspective Network for Speech-Driven Facial Animation

Realistic speech-driven 3D facial animation is a challenging problem due to the complex relationship between speech and face. In this paper, we propose a deep architecture, called Geometry-guided Dense Perspective Network (GDPnet), to achieve speaker-independent realistic 3D facial animation. The encoder is designed with dense connections to strengthen feature propagation and encourage the re-use of audio features, and the decoder is integrated with an attention mechanism to adaptively recalibrate point-wise feature responses by explicitly modeling interdependencies between different neuron units. We also introduce a non-linear face reconstruction representation as a guidance of latent space to obtain more accurate deformation, which helps solve the geometry-related deformation and is good for generalization across subjects. Huber and HSIC (Hilbert-Schmidt Independence Criterion) constraints are adopted to promote the robustness of our model and to better exploit the non-linear and high-order correlations. Experimental results on the public dataset and real scanned dataset validate the superiority of our proposed GDPnet compared with state-of-the-art model.

[1]  Jianfei Cai,et al.  Alive Caricature from 2D to 3D , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Michael J. Black,et al.  Capture, Learning, and Synthesis of 3D Speaking Styles , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Thomas S. Huang,et al.  Real-time speech-driven face animation with expressions using neural networks , 2002, IEEE Trans. Neural Networks.

[4]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[5]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Jaakko Lehtinen,et al.  Audio-driven facial animation by joint end-to-end learning of pose and emotion , 2017, ACM Trans. Graph..

[7]  Subhransu Maji,et al.  Visemenet , 2018, ACM Trans. Graph..

[8]  Yisong Yue,et al.  A deep learning approach for generalized speech animation , 2017, ACM Trans. Graph..

[9]  Hongbin Zha,et al.  Transferring of Speech Movements from Video to 3D Face Space , 2007, IEEE Transactions on Visualization and Computer Graphics.

[10]  Paul E. Debevec,et al.  Multiview face capture using polarized spherical gradient illumination , 2011, ACM Trans. Graph..

[11]  Mark Pauly,et al.  Realtime performance-based facial animation , 2011, ACM Trans. Graph..

[12]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[13]  Björn Granström,et al.  SynFace—Speech-Driven Facial Animation for Virtual Speech-Reading Support , 2009, EURASIP J. Audio Speech Music. Process..

[14]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[16]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[17]  Hai Xuan Pham,et al.  End-to-end Learning for 3D Facial Animation from Speech , 2018, ICMI.

[18]  Ira Kemelmacher-Shlizerman,et al.  Synthesizing Obama , 2017, ACM Trans. Graph..

[19]  Ben P. Milner,et al.  Audio-to-Visual Speech Conversion Using Deep Neural Networks , 2016, INTERSPEECH.

[20]  Michael J. Black,et al.  Generating 3D faces using Convolutional Mesh Autoencoders , 2018, ECCV.

[21]  Lei Xie,et al.  Head motion synthesis from speech using deep neural networks , 2015, Multimedia Tools and Applications.

[22]  Yu-Kun Lai,et al.  Generating 3D Faces using Multi‐column Graph Convolutional Networks , 2019, Comput. Graph. Forum.

[23]  Bernhard Schölkopf,et al.  Measuring Statistical Dependence with Hilbert-Schmidt Norms , 2005, ALT.

[24]  Eugene Fiume,et al.  JALI , 2016, ACM Trans. Graph..

[25]  John P. Lewis,et al.  Expressive Facial Animation Synthesis by Learning Speech Coarticulation and Expression Spaces , 2006, IEEE Transactions on Visualization and Computer Graphics.

[26]  A. Esposito,et al.  Speech driven facial animation , 2001, PUI '01.

[27]  Hai Xuan Pham,et al.  Speech-Driven 3D Facial Animation with Implicit Emotional Awareness: A Deep Learning Approach , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[28]  Juyong Zhang,et al.  Disentangled Representation Learning for 3D Face Shape , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Moshe Mahler,et al.  Dynamic units of visual speech , 2012, SCA '12.

[30]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[31]  Michael J. Black,et al.  Learning a model of facial shape and expression from 4D scans , 2017, ACM Trans. Graph..

[32]  Adrian Hilton,et al.  A FACS valid 3D dynamic action unit database with applications to 3D dynamic morphable facial modeling , 2011, 2011 International Conference on Computer Vision.

[33]  Patrick Pérez,et al.  Deep video portraits , 2018, ACM Trans. Graph..

[34]  W. Bastiaan Kleijn,et al.  The HSIC Bottleneck: Deep Learning without Back-Propagation , 2019, AAAI.

[35]  P. J. Huber Robust Estimation of a Location Parameter , 1964 .

[36]  Qiang Huo,et al.  Video-audio driven real-time facial animation , 2015, ACM Trans. Graph..