CPNet: Exploiting CLIP-based Attention Condenser and Probability Map Guidance for High-fidelity Talking Face Generation

Recently, talking face generation has drawn ever-increasing attention from the research community in computer vision due to its arduous challenges and widespread application scenarios, e.g. movie animation and virtual anchor. Although persevering efforts have been undertaken to enhance the fidelity and lip-sync quality of generated talking face videos, there is still large room for further improvements of synthesis quality and efficiency. Actually, these attempts somewhat ignore the explorations of fine-granularity feature extraction/integration and the consistency between probability distributions of landmarks, thereby recurring the issues of local details blurring and degraded fidelity. To mitigate these dilemmas, in this paper, a novel CLIP-based Attention and Probability Map Guided Network (CPNet) is delicately designed for inferring high-fidelity talking face videos. Specifically, considering the demands of fine-grained feature recalibration, a clip-based attention condenser is exploited to transfer knowledge with rich semantic priors from the prevailing CLIP model. Moreover, to guarantee the consistency in probability space and suppress the landmark ambiguity, we creatively propose the density map of facial landmark as auxiliary supervisory signal to guide the landmark distribution learning of generated frame. Extensive experiments on the widely-used benchmark dataset demonstrate the superiority of our CPNet against state of the arts in terms of image and lip-sync quality. In addition, a cohort of studies are also conducted to ablate the impacts of the individual pivotal components.

[1]  Minglun Gong,et al.  Dynamic Mixture of Counter Network for Location-Agnostic Crowd Counting , 2023, 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[2]  Jun Ling,et al.  StableFace: Analyzing and Improving Motion Stability for Talking Face Generation , 2022, ArXiv.

[3]  Hyoung-Kyu Song,et al.  Talking Face Generation with Multilingual TTS , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  T. Popa,et al.  CLIP-Mesh: Generating textured meshes from text using pretrained image-text models , 2022, SIGGRAPH Asia.

[5]  Lei Xie,et al.  AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary Persons , 2021, IEEE Transactions on Multimedia.

[6]  Jun Zhou,et al.  STNet: Scale Tree Network With Multi-Level Auxiliator for Crowd Counting , 2020, IEEE Transactions on Multimedia.

[7]  Antoni B. Chan,et al.  Kernel-Based Density Map Generation for Dense Object Counting , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Zejun Ma,et al.  Towards Realistic Visual Dubbing with Heterogeneous Sources , 2021, ACM Multimedia.

[9]  Engin Erzin,et al.  Investigating Contributions of Speech and Facial Landmarks for Talking Head Generation , 2021, Interspeech.

[10]  Daniel Cohen-Or,et al.  StyleGAN-NADA , 2021, ACM Trans. Graph..

[11]  Yu Ding,et al.  Flow-guided One-shot Talking Face Generation with a High-resolution Audio-visual Dataset , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Vivek Kwatra,et al.  LipSync3D: Data-Efficient Learning of Personalized 3D Talking Faces from Video using Pose and Lighting Normalization , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Xun Cao,et al.  Audio-Driven Emotional Video Portraits , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Daniel Cohen-Or,et al.  StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[16]  Ran He,et al.  Talking Face Generation via Learning Semantic and Temporal Synchronous Landmarks , 2021, 2020 25th International Conference on Pattern Recognition (ICPR).

[17]  Lingyun Yu,et al.  Multimodal Inputs Driven Talking Face Generation With Spatial–Temporal Dependency , 2021, IEEE Transactions on Circuits and Systems for Video Technology.

[18]  Yan Wang,et al.  Speech Driven Talking Head Generation via Attentional Landmarks Based Representation , 2020, INTERSPEECH.

[19]  C. V. Jawahar,et al.  A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild , 2020, ACM Multimedia.

[20]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  V. Lempitsky,et al.  Few-Shot Adversarial Learning of Realistic Neural Talking Head Models , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Timo Aila,et al.  A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Sjoerd van Steenkiste,et al.  Towards Accurate Generative Models of Video: A New Metric & Challenges , 2018, ArXiv.

[24]  Yoshua Bengio,et al.  ObamaNet: Photo-realistic lip-sync from text , 2017, ArXiv.

[25]  Raymond Y. K. Lau,et al.  Least Squares Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[28]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[29]  Djemel Ziou,et al.  Image Quality Metrics: PSNR vs. SSIM , 2010, 2010 20th International Conference on Pattern Recognition.