Neural Face Models for Example-Based Visual Speech Synthesis

Creating realistic animations of human faces with computer graphic models is still a challenging task. It is often solved either with tedious manual work or motion capture based techniques that require specialised and costly hardware. Example based animation approaches circumvent these problems by re-using captured data of real people. This data is split into short motion samples that can be looped or concatenated in order to create novel motion sequences. The obvious advantages of this approach are the simplicity of use and the high realism, since the data exhibits only real deformations. Rather than tuning weights of a complex face rig, the animation task is performed on a higher level by arranging typical motion samples in a way such that the desired facial performance is achieved. Two difficulties with example based approaches, however, are high memory requirements as well as the creation of artefact-free and realistic transitions between motion samples. We solve these problems by combining the realism and simplicity of example-based animations with the advantages of neural face models. Our neural face model is capable of synthesising high quality 3D face geometry and texture according to a compact latent parameter vector. This latent representation reduces memory requirements by a factor of 100 and helps creating seamless transitions between concatenated motion samples. In this paper, we present a marker-less approach for facial motion capture based on multi-view video. Based on the captured data, we learn a neural representation of facial expressions, which is used to seamlessly concatenate facial performances during the animation procedure. We demonstrate the effectiveness of our approach by synthesising mouthings for Swiss-German sign language based on viseme query sequences.

[1]  Yiying Tong,et al.  FaceWarehouse: A 3D Facial Expression Database for Visual Computing , 2014, IEEE Transactions on Visualization and Computer Graphics.

[2]  E. A. Elliott Phonological Functions of Facial Movements , 2013 .

[3]  Olga Veksler,et al.  Fast approximate energy minimization via graph cuts , 2001, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[4]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Peter Eisert,et al.  Analyzing Facial Expressions for Virtual Conferencing , 1998, IEEE Computer Graphics and Applications.

[6]  Justus Thies,et al.  Real-time expression transfer for facial reenactment , 2015, ACM Trans. Graph..

[7]  Anna Hilsmann,et al.  Real-time avatar animation with dynamic face texturing , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[8]  Francesc Moreno-Noguer,et al.  GANimation: Anatomically-aware Facial Animation from a Single Image , 2018, ECCV.

[9]  Josephine Sullivan,et al.  One millisecond face alignment with an ensemble of regression trees , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Jun'ichiro Seyama,et al.  The Uncanny Valley: Effect of Realism on the Impression of Artificial Human Faces , 2007, PRESENCE: Teleoperators and Virtual Environments.

[11]  Peter Eisert,et al.  Articulated 3D model tracking with on-the-fly texturing , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[12]  Hao Li,et al.  paGAN: real-time avatars using dynamic textures , 2019, ACM Trans. Graph..

[13]  Hanspeter Pfister,et al.  Face transfer with multilinear models , 2005, ACM Trans. Graph..

[14]  Matthew Turk,et al.  A Morphable Model For The Synthesis Of 3D Faces , 1999, SIGGRAPH.

[15]  Yaser Sheikh,et al.  Deep appearance models for face rendering , 2018, ACM Trans. Graph..

[16]  Justus Thies,et al.  Neural Voice Puppetry: Audio-driven Facial Reenactment , 2019, ECCV.

[17]  Joo-Ho Lee,et al.  Talking heads synthesis from audio with deep neural networks , 2015, 2015 IEEE/SICE International Symposium on System Integration (SII).

[18]  Ira Kemelmacher-Shlizerman,et al.  Synthesizing Obama , 2017, ACM Trans. Graph..

[19]  Timothy F. Cootes,et al.  Active Appearance Models , 1998, ECCV.

[20]  Anna Hilsmann,et al.  Video-based facial re-animation , 2015, CVMP.

[21]  Michael J. Black,et al.  Capture, Learning, and Synthesis of 3D Speaking Styles , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Tony Ezzat,et al.  Trainable videorealistic speech animation , 2002, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[23]  Jihun Yu,et al.  Realtime facial animation with on-the-fly correctives , 2013, ACM Trans. Graph..

[24]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[25]  Anna Hilsmann,et al.  Temporally Consistent Wide Baseline Facial Performance Capture via Image Warping , 2015, VMV.

[26]  Yu Tsao,et al.  Audio-visual speech enhancement using deep neural networks , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[27]  Anna Hilsmann,et al.  Interactive facial animation with deep neural networks , 2020, IET Comput. Vis..

[28]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[29]  Anna Hilsmann,et al.  A Hybrid Approach for Facial Performance Analysis and Editing , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[30]  Subhransu Maji,et al.  Visemenet , 2018, ACM Trans. Graph..

[31]  Christian Theobalt,et al.  Reconstructing detailed dynamic face geometry from monocular video , 2013, ACM Trans. Graph..

[32]  Wojciech Matusik,et al.  Video face replacement , 2011, ACM Trans. Graph..

[33]  Adrian Hilton,et al.  4D video textures for interactive character appearance , 2014, Comput. Graph. Forum.

[34]  Lucas Kovar,et al.  Motion Graphs , 2002, ACM Trans. Graph..

[35]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[36]  Marcus A. Magnor,et al.  Making of Who Cares? HD Stereoscopic Free Viewpoint Video , 2011, 2011 Conference for Visual Media Production.

[37]  James Lau,et al.  Playable universal capture , 2006, SIGGRAPH '06.

[38]  Ron Kimmel,et al.  High Quality Facial Surface and Texture Synthesis via Generative Adversarial Networks , 2018, ECCV Workshops.

[39]  P. Eisert,et al.  Neural Face Models for Example-Based Visual Speech Synthesis , 2020, CVMP.

[40]  D. Massaro,et al.  Perceiving Talking Faces , 1995 .

[41]  Hans Peter Graf,et al.  Photo-Realistic Talking-Heads from Image Samples , 2000, IEEE Trans. Multim..

[42]  Jaakko Lehtinen,et al.  Audio-driven facial animation by joint end-to-end learning of pose and emotion , 2017, ACM Trans. Graph..

[43]  Hans-Peter Seidel,et al.  Free-viewpoint video of human actors , 2003, ACM Trans. Graph..

[44]  Patrick Pérez,et al.  Poisson image editing , 2003, ACM Trans. Graph..

[45]  Moshe Mahler,et al.  Dynamic units of visual speech , 2012, SCA '12.

[46]  Adrian Hilton,et al.  A Comparative Study of Free-Viewpoint Video Techniques For sports events , 2006 .

[47]  Barry-John Theobald,et al.  Speaker-Independent Speech-Driven Visual Speech Synthesis using Domain-Adapted Acoustic Models , 2019, ICMI.