StyleSync: High-Fidelity Generalized and Personalized Lip Sync in Style-Based Generator

Despite recent advances in syncing lip movements with any audio waves, current methods still struggle to balance generation quality and the model's generalization ability. Previous studies either require long-term data for training or produce a similar movement pattern on all subjects with low quality. In this paper, we propose StyleSync, an effective framework that enables high-fidelity lip synchronization. We identify that a style-based generator would sufficiently enable such a charming property on both one-shot and few-shot scenarios. Specifically, we design a mask-guided spatial information encoding module that preserves the details of the given face. The mouth shapes are accurately modified by audio through modulated convolutions. Moreover, our design also enables personalized lip-sync by introducing style space and generator refinement on only limited frames. Thus the identity and talking style of a target person could be accurately preserved. Extensive experiments demonstrate the effectiveness of our method in producing high-fidelity results on a variety of scenes. Resources can be found at https://hangz-nju-cuhk.github.io/projects/StyleSync.

[1]  Tangjie Lv,et al.  DINet: Deformation Inpainting Network for Realistic Face Visually Dubbing on High Resolution Video , 2023, AAAI.

[2]  Zhenhui Ye,et al.  GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis , 2023, ICLR.

[3]  Xin Yu,et al.  StyleTalk: One-shot Talking Head Generation with Controllable Speaking Styles , 2023, ArXiv.

[4]  Daniel Cohen-Or,et al.  Pivotal Tuning for Latent-based Editing of Real Images , 2021, ACM Trans. Graph..

[5]  Errui Ding,et al.  Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in Transformers , 2022, SIGGRAPH Asia.

[6]  Gang Zeng,et al.  Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial Decomposition , 2022, ArXiv.

[7]  Ming-Yu Liu,et al.  SPACE: Speech-driven Portrait Animation with Controllable Expression , 2022, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Errui Ding,et al.  StyleSwap: Style-Based Generator Empowers Robust Face Swapping , 2022, ECCV.

[9]  Jiwen Lu,et al.  Learning Dynamic Facial Radiance Fields for Few-Shot Talking Head Synthesis , 2022, ECCV.

[10]  Se Jin Park,et al.  SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory , 2022, AAAI.

[11]  Xiaoguang Han,et al.  Expressive Talking Head Generation with Granular Audio-Visual Control , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Wayne Wu,et al.  EAMM: One-Shot Emotional Talking Face via Audio-Based Emotion-Aware Motion Model , 2022, SIGGRAPH.

[13]  Zili Yi,et al.  Region-Aware Face Swapping , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Yujiu Yang,et al.  StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pre-trained StyleGAN , 2022, ECCV.

[15]  Bolei Zhou,et al.  Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation , 2022, ECCV.

[16]  T. Komura,et al.  FaceFormer: Speech-Driven 3D Facial Animation with Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Xin Yu,et al.  One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning , 2021, AAAI.

[18]  Amit H. Bermano,et al.  HyperStyle: StyleGAN Inversion with HyperNetworks for Real Image Editing , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Xiaoou Tang,et al.  InterFaceGAN: Interpreting the Disentangled Face Representation Learned by GANs , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Chen Change Loy,et al.  Everybody’s Talkin’: Let Me Talk as You Want , 2020, IEEE Transactions on Information Forensics and Security.

[21]  Haozhe Wu,et al.  Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis , 2021, ACM Multimedia.

[22]  Jinxiang Chai,et al.  Live speech portraits , 2021, ACM Trans. Graph..

[23]  Madhukar Budagavi,et al.  FACIAL: Synthesizing Dynamic Talking Face with Implicit Attribute Learning , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Hideki Koike,et al.  Speech2Talking-Face: Inferring and Driving a Face with Synchronized Audio-Visual Representation , 2021, IJCAI.

[25]  Changjie Fan,et al.  Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion , 2021, IJCAI.

[26]  Jaakko Lehtinen,et al.  Alias-Free Generative Adversarial Networks , 2021, NeurIPS.

[27]  Vivek Kwatra,et al.  LipSync3D: Data-Efficient Learning of Personalized 3D Talking Faces from Video using Pose and Lighting Normalization , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Peiran Ren,et al.  GAN Prior Embedded Network for Blind Face Restoration in the Wild , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Chen Change Loy,et al.  Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Yaser Sheikh,et al.  MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Xun Cao,et al.  Audio-Driven Emotional Video Portraits , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Daniel Cohen-Or,et al.  ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[33]  H. Bao,et al.  AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Daniel Cohen-Or,et al.  Designing an encoder for StyleGAN image manipulation , 2021, ACM Trans. Graph..

[35]  Xintao Wang,et al.  Towards Real-World Blind Face Restoration with Generative Facial Prior , 2021, Computer Vision and Pattern Recognition.

[36]  Daniel Cohen-Or,et al.  Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Bolei Zhou,et al.  Closed-Form Factorization of Latent Semantics in GANs , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  C. V. Jawahar,et al.  A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild , 2020, ACM Multimedia.

[39]  Chenliang Xu,et al.  Talking-head Generation with Rhythmic Head Motion , 2020, ECCV.

[40]  Haitian Zheng,et al.  What comprises a good talking-head video generation?: A Survey and Benchmark , 2020, ArXiv.

[41]  Yang Zhou,et al.  MakeltTalk , 2020, ACM Trans. Graph..

[42]  Victor Lempitsky,et al.  Neural Head Reenactment with Latent Pose Descriptors , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Pratul P. Srinivasan,et al.  NeRF , 2020, ECCV.

[44]  Hujun Bao,et al.  Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose , 2020, 2002.10137.

[45]  Justus Thies,et al.  Neural Voice Puppetry: Audio-driven Facial Reenactment , 2019, ECCV.

[46]  Tero Karras,et al.  Analyzing and Improving the Image Quality of StyleGAN , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Peter Wonka,et al.  Image2StyleGAN++: How to Edit the Embedded Images? , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Yu Qiao,et al.  MEAD: A Large-Scale Audio-Visual Dataset for Emotional Talking-Face Generation , 2020, ECCV.

[49]  Joon Son Chung,et al.  You Said That?: Synthesising Talking Faces from Audio , 2019, International Journal of Computer Vision.

[50]  V. Lempitsky,et al.  Few-Shot Adversarial Learning of Realistic Neural Talking Head Models , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[51]  Chenliang Xu,et al.  Hierarchical Cross-Modal Talking Face Generation With Dynamic Pixel-Wise Loss , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Peter Wonka,et al.  Image2StyleGAN: How to Embed Images Into the StyleGAN Latent Space? , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[53]  Jiaolong Yang,et al.  Accurate 3D Face Reconstruction With Weakly-Supervised Learning: From Single Image to Image Set , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[54]  Timo Aila,et al.  A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Hang Zhou,et al.  Talking Face Generation by Adversarially Disentangled Audio-Visual Representation , 2018, AAAI.

[56]  Jingwen Zhu,et al.  Talking Face Generation by Conditional Recurrent Adversarial Network , 2018, IJCAI.

[57]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[58]  Subhransu Maji,et al.  Visemenet , 2018, ACM Trans. Graph..

[59]  Chenliang Xu,et al.  Lip Movements Generation at a Glance , 2018, ECCV.

[60]  Jan Kautz,et al.  High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[61]  Jaakko Lehtinen,et al.  Audio-driven facial animation by joint end-to-end learning of pose and emotion , 2017, ACM Trans. Graph..

[62]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[63]  Joon Son Chung,et al.  Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[64]  Joon Son Chung,et al.  Lip Reading in the Wild , 2016, ACCV.

[65]  Lei Xie,et al.  Photo-real talking head with deep bidirectional LSTM , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[66]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.