What comprises a good talking-head video generation?: A Survey and Benchmark

Over the years, performance evaluation has become essential in computer vision, enabling tangible progress in many sub-fields. While talking-head video generation has become an emerging research topic, existing evaluations on this topic present many limitations. For example, most approaches use human subjects (e.g., via Amazon MTurk) to evaluate their research claims directly. This subjective evaluation is cumbersome, unreproducible, and may impend the evolution of new research. In this work, we present a carefully-designed benchmark for evaluating talking-head video generation with standardized dataset pre-processing strategies. As for evaluation, we either propose new metrics or select the most appropriate ones to evaluate results in what we consider as desired properties for a good talking-head video, namely, identity preserving, lip synchronization, high video quality, and natural-spontaneous motion. By conducting a thoughtful analysis across several state-of-the-art talking-head generation approaches, we aim to uncover the merits and drawbacks of current methods and point out promising directions for future work. All the evaluation code is available at: this https URL.

[1]  Joon Son Chung,et al.  Lip Reading in the Wild , 2016, ACCV.

[2]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[3]  Rada Mihalcea,et al.  MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations , 2018, ACL.

[4]  Lei Xie,et al.  Photo-real talking head with deep bidirectional LSTM , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[6]  Ira Kemelmacher-Shlizerman,et al.  Synthesizing Obama , 2017, ACM Trans. Graph..

[7]  Xiaoming Liu,et al.  Representation Learning by Rotating Your Faces , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Patrick Pérez,et al.  Deep video portraits , 2018, ACM Trans. Graph..

[9]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[10]  Hujun Bao,et al.  Audio-driven Talking Face Video Generation with Natural Head Pose , 2020, ArXiv.

[11]  Ragini Verma,et al.  CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset , 2014, IEEE Transactions on Affective Computing.

[12]  Tony Ezzat,et al.  Transferable videorealistic speech animation , 2005, SCA '05.

[13]  Andrzej Czyzewski,et al.  An audio-visual corpus for multimodal automatic speech recognition , 2017, Journal of Intelligent Information Systems.

[14]  Shilin Wang,et al.  Spatio-Temporal Fusion Based Convolutional Sequence Learning for Lip Reading , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Chen Sun,et al.  Unsupervised Learning of Object Structure and Dynamics from Videos , 2019, NeurIPS.

[16]  Stefanos Zafeiriou,et al.  ArcFace: Additive Angular Margin Loss for Deep Face Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Alexei A. Efros,et al.  Everybody Dance Now , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Maja Pantic,et al.  Realistic Speech-Driven Facial Animation with GANs , 2019, International Journal of Computer Vision.

[19]  Joon Son Chung,et al.  You Said That?: Synthesising Talking Faces from Audio , 2019, International Journal of Computer Vision.

[20]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[21]  Carlos Busso,et al.  MSP-IMPROV: An Acted Corpus of Dyadic Interactions to Study Emotion Perception , 2017, IEEE Transactions on Affective Computing.

[22]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[23]  Lianhong Cai,et al.  Head and facial gestures synthesis using PAD model for an expressive talking avatar , 2014, Multimedia Tools and Applications.

[24]  Patrick Pérez,et al.  VDub: Modifying Face Video of Actors for Plausible Visual Alignment to a Dubbed Audio Track , 2015, Comput. Graph. Forum.

[25]  Justus Thies,et al.  Face2Face: real-time face capture and reenactment of RGB videos , 2019, Commun. ACM.

[26]  Jan Kautz,et al.  Few-shot Video-to-Video Synthesis , 2019, NeurIPS.

[27]  S. R. Livingstone,et al.  The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English , 2018, PloS one.

[28]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[29]  Joon Son Chung,et al.  Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Siwei Zhang,et al.  One-shot Face Reenactment , 2019, BMVC.

[31]  Andrew Zisserman,et al.  Deep Face Recognition , 2015, BMVC.

[32]  Antonio Camurri,et al.  Toward a Minimal Representation of Affective Gestures , 2011, IEEE Transactions on Affective Computing.

[33]  Giampiero Salvi,et al.  Using HMMs and ANNs for mapping acoustic to visual speech , 1999 .

[34]  Chenliang Xu,et al.  Lip Movements Generation at a Glance , 2018, ECCV.

[35]  Jaakko Lehtinen,et al.  Audio-driven facial animation by joint end-to-end learning of pose and emotion , 2017, ACM Trans. Graph..

[36]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[37]  Shenghua Gao,et al.  Future Frame Prediction for Anomaly Detection - A New Baseline , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  D. McNeill,et al.  Speech-gesture mismatches: Evidence for one underlying representation of linguistic and nonlinguistic information , 1998 .

[40]  Antonio Torralba,et al.  Generating Videos with Scene Dynamics , 2016, NIPS.

[41]  Joon Son Chung,et al.  LRS3-TED: a large-scale dataset for visual speech recognition , 2018, ArXiv.

[42]  Jörn Ostermann,et al.  Realistic facial expression synthesis for an image-based talking head , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[43]  Francesc Moreno-Noguer,et al.  GANimation: One-Shot Anatomically Consistent Facial Animation , 2019, International Journal of Computer Vision.

[44]  Andrew Zisserman,et al.  X2Face: A network for controlling face generation by using images, audio, and pose codes , 2018, ECCV.

[45]  Jingwen Zhu,et al.  Talking Face Generation by Conditional Recurrent Adversarial Network , 2018, IJCAI.

[46]  Taesung Park,et al.  Semantic Image Synthesis With Spatially-Adaptive Normalization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Jan Kautz,et al.  High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  Nicu Sebe,et al.  Appearance and Pose-Conditioned Human Image Generation Using Deformable GANs , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[50]  Georgios Tzimiropoulos,et al.  How Far are We from Solving the 2D & 3D Face Alignment Problem? (and a Dataset of 230,000 3D Facial Landmarks) , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[51]  King-Sun Fu,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Publication Information , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[53]  Adam Finkelstein,et al.  Text-based editing of talking-head video , 2019, ACM Trans. Graph..

[54]  Jaegul Choo,et al.  Coloring With Limited Data: Few-Shot Colorization via Memory Augmented Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Hang Zhou,et al.  Talking Face Generation by Adversarially Disentangled Audio-Visual Representation , 2018, AAAI.

[56]  Shimon Whiteson,et al.  LipNet: End-to-End Sentence-level Lipreading , 2016, 1611.01599.

[57]  Jaakko Lehtinen,et al.  Few-Shot Unsupervised Image-to-Image Translation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[58]  Jan Kautz,et al.  Video-to-Video Synthesis , 2018, NeurIPS.

[59]  Andreas Rössler,et al.  FaceForensics++: Learning to Detect Manipulated Facial Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[60]  Joon Son Chung,et al.  Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[61]  Lina J. Karam,et al.  A no-reference perceptual image sharpness metric based on a cumulative probability of blur detection , 2009, 2009 International Workshop on Quality of Multimedia Experience.

[62]  Naomi Harte,et al.  TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech , 2015, IEEE Transactions on Multimedia.

[63]  Daniel Cohen-Or,et al.  Bringing portraits to life , 2017, ACM Trans. Graph..

[64]  Victor Lempitsky,et al.  Few-Shot Adversarial Learning of Realistic Neural Talking Head Models , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[65]  Jitendra Malik,et al.  Learning Individual Styles of Conversational Gesture , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Thomas Huang,et al.  FLNet: Landmark Driven Fetching and Learning Network for Faithful Talking Facial Animation Synthesis , 2019, AAAI.

[67]  Chenliang Xu,et al.  Hierarchical Cross-Modal Talking Face Generation With Dynamic Pixel-Wise Loss , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Samy Bengio,et al.  Large Scale Online Learning of Image Similarity Through Ranking , 2009, J. Mach. Learn. Res..

[69]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.