Laughing Matters: Introducing Laughing-Face Generation using Diffusion Models

Speech-driven animation has gained significant traction in recent years, with current methods achieving near-photorealistic results. However, the field remains underexplored regarding non-verbal communication despite evidence demonstrating its importance in human interaction. In particular, generating laughter sequences presents a unique challenge due to the intricacy and nuances of this behaviour. This paper aims to bridge this gap by proposing a novel model capable of generating realistic laughter sequences, given a still portrait and an audio clip containing laughter. We highlight the failure cases of traditional facial animation methods and leverage recent advances in diffusion models to produce convincing laughter videos. We train our model on a diverse set of laughter datasets and introduce an evaluation metric specifically designed for laughter. When compared with previous speech-driven approaches, our model achieves state-of-the-art performance across all metrics, even when these are re-trained for laughter generation.

[1]  Quoc V. Le,et al.  Symbolic Discovery of Optimization Algorithms , 2023, NeurIPS.

[2]  Quoc V. Le,et al.  Noise2Music: Text-conditioned Music Generation with Diffusion Models , 2023, ArXiv.

[3]  B. Schölkopf,et al.  Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion , 2023, ArXiv.

[4]  M. Pantic,et al.  Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation , 2023, arXiv.org.

[5]  Radu Tudor Ionescu,et al.  Diffusion Models in Vision: A Survey , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Jiwen Lu,et al.  DiffTalk: Crafting Diffusion Models for Generalized Talking Head Synthesis , 2023, ArXiv.

[7]  Daniel C. Tompkins,et al.  BEATs: Audio Pre-Training with Acoustic Tokenizers , 2022, ICML.

[8]  Ming-Yu Liu,et al.  SPACE: Speech-driven Portrait Animation with Controllable Expression , 2022, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  David J. Fleet,et al.  Imagen Video: High Definition Video Generation with Diffusion Models , 2022, ArXiv.

[10]  Yaniv Taigman,et al.  Make-A-Video: Text-to-Video Generation without Text-Video Data , 2022, ICLR.

[11]  Jonathan Ho Classifier-Free Diffusion Guidance , 2022, ArXiv.

[12]  V. Lempitsky,et al.  MegaPortraits: One-shot Megapixel Neural Head Avatars , 2022, ACM Multimedia.

[13]  Xiaoguang Han,et al.  Expressive Talking Head Generation with Granular Audio-Visual Control , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Tero Karras,et al.  Elucidating the Design Space of Diffusion-Based Generative Models , 2022, NeurIPS.

[15]  Wayne Wu,et al.  EAMM: One-Shot Emotional Talking Face via Audio-Based Emotion-Aware Motion Model , 2022, SIGGRAPH.

[16]  David J. Fleet,et al.  Video Diffusion Models , 2022, NeurIPS.

[17]  Juan F. Montesinos,et al.  VocaLiST: An Audio-Visual Synchronisation Model for Lips and Voices , 2022, INTERSPEECH.

[18]  Karsten Kreis,et al.  Tackling the Generative Learning Trilemma with Denoising Diffusion GANs , 2021, ICLR.

[19]  J. Malik,et al.  MViTv2: Improved Multiscale Vision Transformers for Classification and Detection , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Jinyu Li,et al.  WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing , 2021, IEEE Journal of Selected Topics in Signal Processing.

[21]  Andrea Vedaldi,et al.  Audio-Visual Synchronisation in the wild , 2021, BMVC.

[22]  Prafulla Dhariwal,et al.  Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.

[23]  Chen Change Loy,et al.  Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Shitong Luo,et al.  Diffusion Probabilistic Models for 3D Point Cloud Generation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Abhishek Kumar,et al.  Score-Based Generative Modeling through Stochastic Differential Equations , 2020, ICLR.

[26]  C. V. Jawahar,et al.  A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild , 2020, ACM Multimedia.

[27]  Thierry Dutoit,et al.  Laughter Synthesis: Combining Seq2seq modeling with Transfer Learning , 2020, INTERSPEECH.

[28]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[29]  Tero Karras,et al.  Training Generative Adversarial Networks with Limited Data , 2020, NeurIPS.

[30]  Brojeshwar Bhowmick,et al.  Identity-Preserving Realistic Talking Face Generation , 2020, 2020 International Joint Conference on Neural Networks (IJCNN).

[31]  Yang Zhou,et al.  MakeltTalk , 2020, ACM Trans. Graph..

[32]  Yang Song,et al.  Generative Modeling by Estimating Gradients of the Data Distribution , 2019, NeurIPS.

[33]  Maja Pantic,et al.  Realistic Speech-Driven Facial Animation with GANs , 2019, International Journal of Computer Vision.

[34]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[35]  Maja Pantic,et al.  End-to-End Speech-Driven Facial Animation with Temporal GANs , 2018, BMVC.

[36]  Jaakko Lehtinen,et al.  Audio-driven facial animation by joint end-to-end learning of pose and emotion , 2017, ACM Trans. Graph..

[37]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[38]  Joon Son Chung,et al.  You said that? , 2017, BMVC.

[39]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Pascale Fung,et al.  A Long Short-Term Memory Framework for Predicting Humor in Dialogues , 2016, NAACL.

[41]  William Curran,et al.  Laughter Research: A Review of the ILHAIRE Project , 2016, Toward Robotic Socially Believable Behaving Systems.

[42]  Lei Xie,et al.  Photo-real talking head with deep bidirectional LSTM , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[44]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[45]  Catherine Pelachaud,et al.  Laughter animation synthesis , 2014, AAMAS.

[46]  Thierry Dutoit,et al.  Automatic Phonetic Transcription of Laughter and Its Application to Laughter Synthesis , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[47]  Maja Pantic,et al.  The MAHNOB Laughter database , 2013, Image Vis. Comput..

[48]  Thierry Dutoit,et al.  The AVLaughterCycle Database , 2010, LREC.

[49]  Björn Schuller,et al.  Being bored? Recognising natural interest by extensive audiovisual integration for real-life application , 2009, Image Vis. Comput..

[50]  Alex Pentland,et al.  Honest Signals - How They Shape Our World , 2008 .

[51]  Dirk Heylen,et al.  The Sensitive Artificial Listner: an induction technique for generating emotionally coloured conversation , 2008 .

[52]  Lei Xie,et al.  A coupled HMM approach to video-realistic speech animation , 2007, Pattern Recognit..

[53]  J. Trouvain,et al.  IMITATING CONVERSATIONAL LAUGHTER WITH AN ARTICULATORY SPEECH SYNTHESIZER , 2007 .

[54]  Phillip J. Glenn Laughter in Interaction , 2003 .

[55]  P. Ekman,et al.  The expressive pattern of laughter , 2001 .

[56]  Joshua Foer,et al.  Laughter: A Scientific Investigation , 2001, The Yale Journal of Biology and Medicine.

[57]  Hani Yehia,et al.  Quantitative association of vocal-tract and facial behavior , 1998, Speech Commun..

[58]  Satoshi Nakamura,et al.  Lip movement synthesis from speech based on hidden Markov models , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.