VDub: Modifying Face Video of Actors for Plausible Visual Alignment to a Dubbed Audio Track

In many countries, foreign movies and TV productions are dubbed, i.e., the original voice of an actor is replaced with a translation that is spoken by a dubbing actor in the country's own language. Dubbing is a complex process that requires specific translations and accurately timed recitations such that the new audio at least coarsely adheres to the mouth motion in the video. However, since the sequence of phonemes and visemes in the original and the dubbing language are different, the video‐to‐audio match is never perfect, which is a major source of visual discomfort. In this paper, we propose a system to alter the mouth motion of an actor in a video, so that it matches the new audio track. Our paper builds on high‐quality monocular capture of 3D facial performance, lighting and albedo of the dubbing and target actors, and uses audio analysis in combination with a space‐time retrieval method to synthesize a new photo‐realistically rendered and highly detailed 3D shape model of the mouth region to replace the target performance. We demonstrate plausible visual quality of our results compared to footage that has been professionally dubbed in the traditional way, both qualitatively and through a user study.

[1]  Ken-ichi Anjyo,et al.  Practice and Theory of Blendshape Facial Models , 2014, Eurographics.

[2]  Zhigang Deng,et al.  Eurographics/ Acm Siggraph Symposium on Computer Animation (2006) Efase: Expressive Facial Animation Synthesis and Editing with Phoneme-isomap Controls , 2022 .

[3]  Ronald Fedkiw,et al.  Eurographics/ Acm Siggraph Symposium on Computer Animation (2006) Simulating Speech with a Physics-based Facial Muscle Model , 2022 .

[4]  Jörn Ostermann,et al.  Realistic facial expression synthesis for an image-based talking head , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[5]  Kun Zhou,et al.  Displaced dynamic expression regression for real-time facial tracking and animation , 2014, ACM Trans. Graph..

[6]  Michel Gendreau,et al.  Hyper-heuristics: a survey of the state of the art , 2013, J. Oper. Res. Soc..

[7]  Yangang Wang,et al.  Online modeling for realtime facial animation , 2013, ACM Trans. Graph..

[8]  Patrick Pérez,et al.  Automatic Face Reenactment , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Björn Stenger,et al.  Expressive Visual Text-to-Speech Using Active Appearance Models , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Hans-Peter Seidel,et al.  A Generic Framework for Efficient 2-D and 3-D Facial Expression Analogy , 2007, IEEE Transactions on Multimedia.

[11]  Ira Kemelmacher-Shlizerman,et al.  Being John Malkovich , 2010, ECCV.

[12]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[13]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[14]  Sergey Levine,et al.  Real-time prosody-driven synthesis of body language , 2009, ACM Trans. Graph..

[15]  Jun-yong Noh,et al.  Expression cloning , 2001, SIGGRAPH.

[16]  Michael M. Cohen,et al.  Real-time analysis-synthesis and intelligibility of talking faces , 1994, SSW.

[17]  Hans-Peter Seidel,et al.  Video-based characters: creating new human performances from a multi-view video database , 2011, ACM Trans. Graph..

[18]  Mark Pauly,et al.  Realtime performance-based facial animation , 2011, ACM Trans. Graph..

[19]  Jan Kautz,et al.  Video-based characters: creating new human performances from a multi-view video database , 2011, SIGGRAPH 2011.

[20]  Barry-John Theobald,et al.  Real-time expression cloning using appearance models , 2007, ICMI '07.

[21]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[22]  Richard Kilborn,et al.  `Speak my language': current attitudes to television subtitling and dubbing , 1993 .

[23]  Carlos Castro,et al.  A Flexible and Adaptive Hyper-heuristic Approach for (Dynamic) Capacitated Vehicle Routing Problems , 2012, Fundam. Informaticae.

[24]  Steve Young,et al.  The HTK book , 1995 .

[25]  Sharon Lesner,et al.  Visual vowel and diphthong perception across speakers , 1981 .

[26]  Ken-ichi Anjyo,et al.  Spacetime expression cloning for blendshapes , 2012, TOGS.

[27]  Wojciech Matusik,et al.  Video face replacement , 2011, ACM Trans. Graph..

[28]  Lance Williams,et al.  Performance-driven facial animation , 1990, SIGGRAPH.

[29]  Graham Kendall,et al.  A Classification of Hyper-heuristic Approaches , 2010 .

[30]  Tony Ezzat,et al.  Transferable videorealistic speech animation , 2005, SCA '05.

[31]  David Salesin,et al.  Resynthesizing facial animation through 3D model-based tracking , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[32]  Simon Lucey,et al.  Deformable Model Fitting by Regularized Landmark Mean-Shift , 2010, International Journal of Computer Vision.

[33]  Hanspeter Pfister,et al.  Face transfer with multilinear models , 2005, ACM Trans. Graph..

[34]  Shigeo Morishima,et al.  Data-Driven Speech Animation Synthesis Focusing on Realistic Inside of the Mouth , 2014, J. Inf. Process..

[35]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[36]  Malcolm Slaney,et al.  FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks , 2000, NIPS.

[37]  Nancy Argüelles,et al.  Author ' s , 2008 .

[38]  Ronald A. Cole,et al.  Accurate visible speech synthesis based on concatenating variable length motion capture data , 2006, IEEE Transactions on Visualization and Computer Graphics.

[39]  Jing Xiao,et al.  Vision-based control of 3D facial animation , 2003, SCA '03.

[40]  Nadia Magnenat-Thalmann,et al.  Visyllable Based Speech Animation , 2003, Comput. Graph. Forum.

[41]  Tomaso A. Poggio,et al.  Reanimating Faces in Images and Video , 2003, Comput. Graph. Forum.

[42]  E. Owens,et al.  Visemes observed by hearing-impaired and normal-hearing adult viewers. , 1985, Journal of speech and hearing research.

[43]  Tony Ezzat,et al.  Trainable videorealistic speech animation , 2002, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[44]  Christian Theobalt,et al.  Reconstructing detailed dynamic face geometry from monocular video , 2013, ACM Trans. Graph..

[45]  Richard Szeliski,et al.  Video textures , 2000, SIGGRAPH.

[46]  Eddie Kohler,et al.  Real-time speech motion synthesis from recorded motions , 2004, SCA '04.

[47]  Moshe Mahler,et al.  Dynamic units of visual speech , 2012, SCA '12.

[48]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[49]  Qionghai Dai,et al.  A data-driven approach for facial expression synthesis in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Hani Yehia,et al.  Quantitative association of vocal-tract and facial behavior , 1998, Speech Commun..

[51]  Derek Bradley,et al.  High-quality passive facial performance capture using anchor frames , 2011, ACM Trans. Graph..

[52]  Q. Summerfield,et al.  Lipreading and audio-visual speech perception. , 1992, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.