LipSync3D: Data-Efficient Learning of Personalized 3D Talking Faces from Video using Pose and Lighting Normalization

In this paper, we present a video-based learning framework for animating personalized 3D talking faces from audio. We introduce two training-time data normalizations that significantly improve data sample efficiency. First, we isolate and represent faces in a normalized space that decouples 3D geometry, head pose, and texture. This decomposes the prediction problem into regressions over the 3D face shape and the corresponding 2D texture atlas. Second, we leverage facial symmetry and approximate albedo constancy of skin to isolate and remove spatio-temporal lighting variations. Together, these normalizations allow simple networks to generate high fidelity lip-sync videos under novel ambient illumination while training with just a single speaker-specific video. Further, to stabilize temporal dynamics, we introduce an auto-regressive approach that conditions the model on its previous visual state. Human ratings and objective metrics demonstrate that our method outperforms contemporary state-of-the-art audio-driven video reenactment benchmarks in terms of realism, lip-sync and visual quality scores. We illustrate several applications enabled by our framework.

[1]  Frank K. Soong,et al.  Text Driven 3D Photo-Realistic Talking Head , 2011, INTERSPEECH.

[2]  Michael J. Black,et al.  Learning a model of facial shape and expression from 4D scans , 2017, ACM Trans. Graph..

[3]  Hang Zhou,et al.  Talking Face Generation by Adversarially Disentangled Audio-Visual Representation , 2018, AAAI.

[4]  Patrick Pérez,et al.  Deep video portraits , 2018, ACM Trans. Graph..

[5]  S. Umeyama,et al.  Least-Squares Estimation of Transformation Parameters Between Two Point Patterns , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Maja Pantic,et al.  Realistic Speech-Driven Facial Animation with GANs , 2019, International Journal of Computer Vision.

[7]  Chenliang Xu,et al.  Lip Movements Generation at a Glance , 2018, ECCV.

[8]  Michael J. Black,et al.  Capture, Learning, and Synthesis of 3D Speaking Styles , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Subhransu Maji,et al.  Visemenet , 2018, ACM Trans. Graph..

[10]  Ken-ichi Anjyo,et al.  Practice and Theory of Blendshape Facial Models , 2014, Eurographics.

[11]  Justus Thies,et al.  Neural Voice Puppetry: Audio-driven Facial Reenactment , 2020, ECCV.

[12]  John P. Lewis,et al.  Face Stabilization by Mode Pursuit for Avatar Construction , 2018, 2018 International Conference on Image and Vision Computing New Zealand (IVCNZ).

[13]  R. Gur,et al.  Generating an item pool for translational social cognition research: Methodology and initial validation , 2014, Behavior Research Methods.

[14]  Paul Dixon,et al.  Modality Dropout for Improved Performance-driven Talking Faces , 2020, ICMI.

[15]  Naomi Harte,et al.  TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech , 2015, IEEE Transactions on Multimedia.

[16]  Shimon Whiteson,et al.  LipNet: End-to-End Sentence-level Lipreading , 2016, 1611.01599.

[17]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[18]  Andrew Zisserman,et al.  X2Face: A network for controlling face generation by using images, audio, and pose codes , 2018, ECCV.

[19]  Jaakko Lehtinen,et al.  Audio-driven facial animation by joint end-to-end learning of pose and emotion , 2017, ACM Trans. Graph..

[20]  Karlheinz Gröchenig,et al.  Short-time Fourier transform , 2003 .

[21]  Eugene Fiume,et al.  JALI , 2016, ACM Trans. Graph..

[22]  Chenliang Xu,et al.  Hierarchical Cross-Modal Talking Face Generation With Dynamic Pixel-Wise Loss , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  King-Sun Fu,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Publication Information , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Maja Pantic,et al.  End-to-End Speech-Driven Realistic Facial Animation with Temporal GANs , 2019, CVPR Workshops.

[25]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[26]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[27]  Joon Son Chung,et al.  Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[28]  Stefanos Zafeiriou,et al.  Optimal UV spaces for facial morphable model construction , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[29]  C. V. Jawahar,et al.  A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild , 2020, ACM Multimedia.

[30]  Chen Change Loy,et al.  Everybody’s Talkin’: Let Me Talk as You Want , 2020, IEEE Transactions on Information Forensics and Security.

[31]  Heiga Zen,et al.  Hierarchical Generative Modeling for Controllable Speech Synthesis , 2018, ICLR.

[32]  C. V. Jawahar,et al.  Towards Automatic Face-to-Face Translation , 2019, ACM Multimedia.

[33]  Adam Finkelstein,et al.  Text-based editing of talking-head video , 2019, ACM Trans. Graph..

[34]  Joon Son Chung,et al.  You said that? , 2017, BMVC.

[35]  Yann LeCun,et al.  Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[36]  Lina J. Karam,et al.  A no-reference perceptual image sharpness metric based on a cumulative probability of blur detection , 2009, 2009 International Workshop on Quality of Multimedia Experience.

[37]  Narendra Ahuja,et al.  Efficient and Robust Specular Highlight Removal , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Derek Bradley,et al.  Rigid stabilization of facial expressions , 2014, ACM Trans. Graph..

[39]  Yury Kartynnik,et al.  Real-time Facial Surface Geometry from Monocular Video on Mobile GPUs , 2019, ArXiv.

[40]  Gaurav Mittal,et al.  Animating Face using Disentangled Audio Representations , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[41]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.