ReliTalk: Relightable Talking Portrait Generation from a Single Video

Recent years have witnessed great progress in creating vivid audio-driven portraits from monocular videos. However, how to seamlessly adapt the created video avatars to other scenarios with different backgrounds and lighting conditions remains unsolved. On the other hand, existing relighting studies mostly rely on dynamically lighted or multi-view data, which are too expensive for creating video portraits. To bridge this gap, we propose ReliTalk, a novel framework for relightable audio-driven talking portrait generation from monocular videos. Our key insight is to decompose the portrait's reflectance from implicitly learned audio-driven facial normals and images. Specifically, we involve 3D facial priors derived from audio features to predict delicate normal maps through implicit functions. These initially predicted normals then take a crucial part in reflectance decomposition by dynamically estimating the lighting condition of the given video. Moreover, the stereoscopic face representation is refined using the identity-consistent loss under simulated multiple lighting conditions, addressing the ill-posed problem caused by limited views available from a single monocular video. Extensive experiments validate the superiority of our proposed framework on both real and synthetic datasets. Our code is released in https://github.com/arthur-qiu/ReliTalk.

[1]  Gang Zeng,et al.  Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial Decomposition , 2022, ArXiv.

[2]  J. Kautz,et al.  Learning to Relight Portrait Images via a Virtual Light Stage and Synthetic-to-Real Adaptation , 2022, ACM Trans. Graph..

[3]  F. Moreno-Noguer,et al.  SIRA: Relightable Avatars from a Single Image , 2022, 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[4]  Jiwen Lu,et al.  Learning Dynamic Facial Radiance Fields for Few-Shot Talking Head Synthesis , 2022, ECCV.

[5]  A. Schwing,et al.  Generative Multiplane Images: Making a 2D GAN 3D-Aware , 2022, ECCV.

[6]  Ziwei Liu,et al.  Relighting4D: Neural Relightable Human from Videos , 2022, ECCV.

[7]  X. Zhang,et al.  SunStage: Portrait Reconstruction and Relighting Using the Sun as a Light Stage , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  M. Sarkis,et al.  Face Relighting with Geometrically Consistent Shadows , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Bolei Zhou,et al.  Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation , 2022, ECCV.

[10]  Shunyu Yao,et al.  DFA-NeRF: Personalized Talking Head Generation via Disentangled Face Attributes Neural Rendering , 2022, ArXiv.

[11]  Jeong Joon Park,et al.  StyleSDF: High-Resolution 3D-Consistent Image and Geometry Generation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Bolei Zhou,et al.  3D-aware Image Synthesis via Learning Structural and Textural Representations , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Shalini De Mello,et al.  Efficient Geometry-aware 3D Generative Adversarial Networks , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Michael J. Black,et al.  I M Avatar: Implicit Morphable Head Avatars from Videos , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  T. Komura,et al.  FaceFormer: Speech-Driven 3D Facial Animation with Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Yebin Liu,et al.  FENeRF: Face Editing in Neural Radiance Fields , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Pratul P. Srinivasan,et al.  Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Ruigang Yang,et al.  FaceScape: 3D Facial Dataset and Benchmark for Single-View 3D Face Reconstruction , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Haozhe Wu,et al.  Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis , 2021, ACM Multimedia.

[20]  Michael J. Black,et al.  Learning an animatable detailed 3D face model from in-the-wild images , 2021, ACM Trans. Graph..

[21]  Yu Ding,et al.  Flow-guided One-shot Talking Face Generation with a High-resolution Audio-visual Dataset , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Chen Change Loy,et al.  Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Yaser Sheikh,et al.  MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Xun Cao,et al.  Audio-Driven Emotional Video Portraits , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Michel Sarkis,et al.  Towards High Fidelity Face Relighting with Realistic Shadows , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Jingyi Yu,et al.  Neural Video Portrait Relighting in Real-time via Consistency Modeling , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Pratul P. Srinivasan,et al.  Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  H. Bao,et al.  AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Jonathan T. Barron,et al.  NeRV: Neural Reflectance and Visibility Fields for Relighting and View Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Jiajun Wu,et al.  pi-GAN: Periodic Implicit Generative Adversarial Networks for 3D-Aware Image Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Quan Wang,et al.  Single image portrait relighting via explicit multiple reflectance channel modeling , 2020, ACM Trans. Graph..

[32]  Chen Change Loy,et al.  Do 2D GANs Know 3D Shape? Unsupervised 3D shape reconstruction from 2D Image GANs , 2020, ICLR.

[33]  C. V. Jawahar,et al.  A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild , 2020, ACM Multimedia.

[34]  E. Kalogerakis,et al.  MakeItTalk: Speaker-Aware Talking Head Animation , 2020, ArXiv.

[35]  Ruigang Yang,et al.  FaceScape: A Large-Scale High Quality 3D Face Dataset and Detailed Riggable 3D Face Prediction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Hujun Bao,et al.  Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose , 2020, 2002.10137.

[37]  Chen Change Loy,et al.  Everybody’s Talkin’: Let Me Talk as You Want , 2020, IEEE Transactions on Information Forensics and Security.

[38]  Justus Thies,et al.  Neural Voice Puppetry: Audio-driven Facial Reenactment , 2019, ECCV.

[39]  Shaodi You,et al.  Unsupervised Learning for Intrinsic Image Decomposition From a Single Image , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  David W. Jacobs,et al.  Deep Single-Image Portrait Relighting , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[41]  Andreas M. Lehrmann,et al.  Learning Physics-Guided Face Relighting Under Directional Light , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Chenliang Xu,et al.  Hierarchical Cross-Modal Talking Face Generation With Dynamic Pixel-Wise Loss , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Michael J. Black,et al.  Capture, Learning, and Synthesis of 3D Speaking Styles , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Yun-Ta Tsai,et al.  Single image portrait relighting , 2019, ACM Trans. Graph..

[45]  Hang Zhou,et al.  Talking Face Generation by Adversarially Disentangled Audio-Visual Representation , 2018, AAAI.

[46]  Patrick Pérez,et al.  Deep video portraits , 2018, ACM Trans. Graph..

[47]  Han Zhang,et al.  Self-Attention Generative Adversarial Networks , 2018, ICML.

[48]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[49]  Michael J. Black,et al.  Learning a model of facial shape and expression from 4D scans , 2017, ACM Trans. Graph..

[50]  Jaakko Lehtinen,et al.  Audio-driven facial animation by joint end-to-end learning of pose and emotion , 2017, ACM Trans. Graph..

[51]  Yisong Yue,et al.  A deep learning approach for generalized speech animation , 2017, ACM Trans. Graph..

[52]  Ersin Yumer,et al.  Neural Face Editing with Intrinsic Image Disentangling , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Joon Son Chung,et al.  Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[55]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  A. Ng,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[58]  Jitendra Malik,et al.  Shape, Illumination, and Reflectance from Shading , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[59]  Per H. Christensen,et al.  An approximate reflectance profile for efficient subsurface scattering , 2015, SIGGRAPH Talks.

[60]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Roland Hess,et al.  Blender Foundations: The Essential Guide to Learning Blender 2.6 , 2010 .

[62]  Gang Hua,et al.  Face Relighting from a Single Image under Arbitrary Unknown Lighting Conditions , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[63]  P. Hanrahan,et al.  On the relationship between radiance and irradiance: determining the illumination from images of a convex Lambertian object. , 2001, Journal of the Optical Society of America. A, Optics, image science, and vision.

[64]  Ronen Basri,et al.  Lambertian reflectance and linear subspaces , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[65]  Paul E. Debevec,et al.  Acquiring the reflectance field of a human face , 2000, SIGGRAPH.

[66]  James F. Blinn,et al.  Models of light reflection for computer synthesized pictures , 1977, SIGGRAPH.

[67]  P. Debevec,et al.  Total Relighting: Learning to Relight Portraits for Background Replacement , 2021 .

[68]  Andrew Zisserman,et al.  Deep Face Recognition , 2015, BMVC.

[69]  Matthew Turk,et al.  A Morphable Model For The Synthesis Of 3D Faces , 1999, SIGGRAPH.