论文信息 - Neural Voice Puppetry: Audio-driven Facial Reenactment

Neural Voice Puppetry: Audio-driven Facial Reenactment

We present Neural Voice Puppetry, a novel approach for audio-driven facial video synthesis. Given an audio sequence of a source person or digital assistant, we generate a photo-realistic output video of a target person that is in sync with the audio of the source input. This audio-driven facial reenactment is driven by a deep neural network that employs a latent 3D face model space. Through the underlying 3D representation, the model inherently learns temporal stability while we leverage neural rendering to generate photo-realistic output frames. Our approach generalizes across different people, allowing us to synthesize videos of a target actor with the voice of any unknown source actor or even synthetic voices that can be generated utilizing standard text-to-speech approaches. Neural Voice Puppetry has a variety of use-cases, including audio-driven video avatars, video dubbing, and text-driven video synthesis of a talking head. We demonstrate the capabilities of our method in a series of audio- and text-based puppetry examples, including comparisons to state-of-the-art techniques and a user study.

[1] Thomas Vetter,et al. A morphable model for the synthesis of 3D faces , 1999, SIGGRAPH.

[2] Dimitris Samaras,et al. EyeOpener: Editing Eyes in the Wild , 2017, ACM Trans. Graph..

[3] Christian Theobalt,et al. Reconstruction of Personalized 3D Face Rigs from Monocular Video , 2016, ACM Trans. Graph..

[4] Michael J. Black,et al. Learning a model of facial shape and expression from 4D scans , 2017, ACM Trans. Graph..

[5] Patrick Pérez,et al. State of the Art on Monocular 3D Face Reconstruction, Tracking, and Applications , 2018, Comput. Graph. Forum.

[6] Patrick Pérez,et al. Deep video portraits , 2018, ACM Trans. Graph..

[7] Adam Finkelstein,et al. Text-based editing of talking-head video , 2019, ACM Trans. Graph..

[8] Tomaso Poggio,et al. Trainable Videorealistic Speech Animation , 2004, FGR.

[9] Han Zhang,et al. Self-Attention Generative Adversarial Networks , 2018, ICML.

[10] Alexei A. Efros,et al. Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Joanne Tarasuik,et al. Seeing is Believing but is Hearing? Comparing Audio and Video Communication for Young Children , 2013, Front. Psychol..

[12] Yisong Yue,et al. A deep learning approach for generalized speech animation , 2017, ACM Trans. Graph..

[13] Patrick Nguyen,et al. Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis , 2018, NeurIPS.

[14] Justus Thies,et al. Deferred neural rendering , 2019, ACM Trans. Graph..

[15] Hai Xuan Pham,et al. Speech-Driven 3D Facial Animation with Implicit Emotional Awareness: A Deep Learning Approach , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[16] Li Fei-Fei,et al. Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[17] Justus Thies,et al. FaceVR , 2018, ACM Trans. Graph..

[18] Chenliang Xu,et al. Hierarchical Cross-Modal Talking Face Generation With Dynamic Pixel-Wise Loss , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Abigail Sellen,et al. Video-Mediated Communication , 1997 .

[20] Hans-Peter Seidel,et al. Neural style-preserving visual dubbing , 2019, ACM Trans. Graph..

[21] Stefanos Zafeiriou,et al. Synthesising 3D Facial Motion from “In-the-Wild” Speech , 2019, 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020).

[22] Yaser Sheikh,et al. Deep appearance models for face rendering , 2018, ACM Trans. Graph..

[23] Michael J. Black,et al. Capture, Learning, and Synthesis of 3D Speaking Styles , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25] Ira Kemelmacher-Shlizerman,et al. Synthesizing Obama , 2017, ACM Trans. Graph..

[26] Kevin Wilson,et al. Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[27] Joon Son Chung,et al. Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[28] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[29] Joon Son Chung,et al. You said that? , 2017, BMVC.

[30] Jaakko Lehtinen,et al. Audio-driven facial animation by joint end-to-end learning of pose and emotion , 2017, ACM Trans. Graph..

[31] Joon Son Chung,et al. Lip Reading in the Wild , 2016, ACCV.

[32] Maja Pantic,et al. Realistic Speech-Driven Facial Animation with GANs , 2019, International Journal of Computer Vision.

[33] Maja Pantic,et al. End-to-End Speech-Driven Facial Animation with Temporal GANs , 2018, BMVC.

[34] Erich Elsen,et al. Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[35] Christoph Bregler,et al. Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[36] Andreas Rössler,et al. ForensicTransfer: Weakly-supervised Domain Adaptation for Forgery Detection , 2018, ArXiv.

[37] Patrick Pérez,et al. VDub: Modifying Face Video of Actors for Plausible Visual Alignment to a Dubbed Audio Track , 2015, Comput. Graph. Forum.

[38] Justus Thies,et al. Face2Face: real-time face capture and reenactment of RGB videos , 2019, Commun. ACM.

[39] Andreas Rössler,et al. FaceForensics++: Learning to Detect Manipulated Facial Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40] Justus Thies,et al. Headon , 2018, ACM Trans. Graph..