Synthesizing Obama

Given audio of President Barack Obama, we synthesize a high quality video of him speaking with accurate lip sync, composited into a target video clip. Trained on many hours of his weekly address footage, a recurrent neural network learns the mapping from raw audio features to mouth shapes. Given the mouth shape at each time instant, we synthesize high quality mouth texture, and composite it with proper 3D pose matching to change what he appears to be saying in a target video to match the input audio track. Our approach produces photorealistic results.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Frank K. Soong,et al.  High quality lip-sync animation for 3D photo-realistic talking head , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Patrick Pérez,et al.  VDub: Modifying Face Video of Actors for Plausible Visual Alignment to a Dubbed Audio Track , 2015, Comput. Graph. Forum.

[4]  Ira Kemelmacher-Shlizerman,et al.  Head Reconstruction from Internet Photos , 2016, ECCV.

[5]  Justus Thies,et al.  Face2Face: Real-Time Face Capture and Reenactment of RGB Videos , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Björn Stenger,et al.  An expressive text-driven 3D talking head , 2013, SIGGRAPH '13.

[7]  Timothy F. Cootes,et al.  Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Shigeo Morishima,et al.  Data-Driven Speech Animation Synthesis Focusing on Realistic Inside of the Mouth , 2014, J. Inf. Process..

[9]  Frédo Durand,et al.  Style transfer for headshot portraits , 2014, ACM Trans. Graph..

[10]  Fernando De la Torre,et al.  Supervised Descent Method and Its Applications to Face Alignment , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[12]  Lei Xie,et al.  A coupled HMM approach to video-realistic speech animation , 2007, Pattern Recognit..

[13]  Ben P. Milner,et al.  Audio-to-Visual Speech Conversion Using Deep Neural Networks , 2016, INTERSPEECH.

[14]  Joo-Ho Lee,et al.  Talking heads synthesis from audio with deep neural networks , 2015, 2015 IEEE/SICE International Symposium on System Integration (SII).

[15]  Hao Li,et al.  Real-Time Facial Segmentation and Performance Capture from RGB Input , 2016, ECCV.

[16]  Edward H. Adelson,et al.  Human-assisted motion annotation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Timothy F. Cootes,et al.  Active Appearance Models , 1998, ECCV.

[18]  Kun Zhou,et al.  Real-time facial animation with image-based dynamic avatars , 2016, ACM Trans. Graph..

[19]  Justus Thies,et al.  Real-time expression transfer for facial reenactment , 2015, ACM Trans. Graph..

[20]  Moshe Mahler,et al.  Dynamic units of visual speech , 2012, SCA '12.

[21]  Zoubin Ghahramani,et al.  A Theoretically Grounded Application of Dropout in Recurrent Neural Networks , 2015, NIPS.

[22]  Ricardo Gutierrez-Osuna,et al.  Audio/visual mapping with cross-modal hidden Markov models , 2005, IEEE Transactions on Multimedia.

[23]  Alexandru Telea,et al.  An Image Inpainting Technique Based on the Fast Marching Method , 2004, J. Graphics, GPU, & Game Tools.

[24]  Antonio Torralba,et al.  To appear in the ACM SIGGRAPH conference proceedings Hybrid images , 2006 .

[25]  Frank K. Soong,et al.  A deep bidirectional LSTM approach for video-realistic talking head , 2016, Multimedia Tools and Applications.

[26]  Ira Kemelmacher-Shlizerman,et al.  What Makes Tom Hanks Look Like Tom Hanks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[27]  Wojciech Matusik,et al.  Video face replacement , 2011, ACM Trans. Graph..

[28]  Wesley Mattheyses,et al.  Comprehensive many-to-many phoneme-to-viseme mapping and its application for concatenative visual speech synthesis , 2013, Speech Commun..

[29]  Tony Ezzat,et al.  Trainable videorealistic speech animation , 2002, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[30]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[31]  Ira Kemelmacher-Shlizerman,et al.  Total Moving Face Reconstruction , 2014, ECCV.

[32]  Wilmot Li,et al.  Tools for placing cuts and transitions in interview video , 2012, ACM Trans. Graph..

[33]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[34]  Patrick Pérez,et al.  Automatic Face Reenactment , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[36]  Lei Xie,et al.  Realistic Mouth-Synching for Speech-Driven Talking Face Using Articulatory Modelling , 2007, IEEE Transactions on Multimedia.

[37]  Wesley Mattheyses,et al.  Audiovisual speech synthesis: An overview of the state-of-the-art , 2015, Speech Commun..

[38]  Davis E. King,et al.  Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[39]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[40]  Ira Kemelmacher-Shlizerman,et al.  Collection flow , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Frédéric H. Pighin,et al.  Expressive speech-driven facial animation , 2005, TOGS.

[42]  Yisong Yue,et al.  A Decision Tree Framework for Spatiotemporal Sequence Prediction , 2015, KDD.

[43]  Qionghai Dai,et al.  A data-driven approach for facial expression synthesis in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Frank K. Soong,et al.  A new language independent, photo-realistic talking head driven by voice only , 2013, INTERSPEECH.

[45]  Frank K. Soong,et al.  Synthesizing photo-real talking head via trajectory-guided sample selection , 2010, INTERSPEECH.

[46]  Lei Xie,et al.  Photo-real talking head with deep bidirectional LSTM , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[47]  Keiichi Tokuda,et al.  HMM-based text-to-audio-visual speech synthesis , 2000, INTERSPEECH.

[48]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[49]  Edward H. Adelson,et al.  A multiresolution spline with application to image mosaics , 1983, TOGS.

[50]  Navdeep Jaitly,et al.  Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[51]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[52]  Björn Stenger,et al.  Expressive Visual Text-to-Speech Using Active Appearance Models , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[53]  Hanspeter Pfister,et al.  Face transfer with multilinear models , 2005, ACM Trans. Graph..

[54]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[55]  Justus Thies,et al.  Face2Face: real-time face capture and reenactment of RGB videos , 2019, Commun. ACM.