Video Rewrite: Driving Visual Speech with Audio

Video Rewrite uses existing footage to create automatically new video of a person mouthing words that she did not speak in the original footage. This technique is useful in movie dubbing, for example, where the movie sequence can be modified to sync the actors’ lip motions to the new soundtrack. Video Rewrite automatically labels the phonemes in the training data and in the new audio track. Video Rewrite reorders the mouth images in the training footage to match the phoneme sequence of the new audio track. When particular phonemes are unavailable in the training footage, Video Rewrite selects the closest approximations. The resulting sequence of mouth images is stitched into the background footage. This stitching process automatically corrects for differences in head position and orientation between the mouth images and the background footage. Video Rewrite uses computer-vision techniques to track points on the speaker’s mouth in the training footage, and morphing techniques to combine these mouth gestures into the final video sequence. The new video combines the dynamics of the original actor’s articulations with the mannerisms and setting dictated by the background footage. Video Rewrite is the first facial-animation system to automate all the labeling and assembly tasks required to resync existing footage to a new soundtrack.

[1]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[2]  Frederick I. Parke,et al.  Computer generated animation of faces , 1972, ACM Annual Conference.

[3]  Edward H. Adelson,et al.  A multiresolution spline with application to image mosaics , 1983, TOGS.

[4]  E. Owens,et al.  Visemes observed by hearing-impaired and normal-hearing adult viewers. , 1985, Journal of speech and hearing research.

[5]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[6]  J. L. Le Saint-Milon,et al.  A real-time French text-to-speech system generating high-quality synthetic speech , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[7]  Lawrence Sirovich,et al.  Application of the Karhunen-Loeve Procedure for the Characterization of Human Faces , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Lance Williams,et al.  Performance-driven facial animation , 1990, SIGGRAPH.

[9]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[10]  John Lewis,et al.  Automated lip-sync: Background and techniques , 1991, Comput. Animat. Virtual Worlds.

[11]  Hiroshi Harashima,et al.  A Media Conversion from Speech to Facial Image for Intelligent Man-Machine Interface , 1991, IEEE J. Sel. Areas Commun..

[12]  Thaddeus Beier,et al.  Feature-based image metamorphosis , 1992, SIGGRAPH.

[13]  Michael M. Cohen,et al.  Modeling Coarticulation in Synthetic Visual Speech , 1993 .

[14]  Christian Benoît,et al.  A 3-d model of the lips for visual speech synthesis , 1994, SSW.

[15]  Lance Williams,et al.  Animating images with drawings , 1994, SIGGRAPH.

[16]  John R. Wright,et al.  Synthesis of Speaker Facial Movement to Match Selected Speech Sequences , 1994 .

[17]  J. Ohala Sound symbolism: The frequency code underlies the sound-symbolic use of voice pitch , 1995 .

[18]  Stephen M. Omohundro,et al.  Nonlinear manifold learning for visual speech recognition , 1995, Proceedings of IEEE International Conference on Computer Vision.

[19]  Michael J. Black,et al.  Tracking and recognizing rigid and non-rigid facial motions using local parametric models of image motion , 1995, Proceedings of IEEE International Conference on Computer Vision.

[20]  Timothy F. Cootes,et al.  A unified approach to coding and interpreting face images , 1995, Proceedings of IEEE International Conference on Computer Vision.

[21]  Frederick I. Parke,et al.  Computer gernerated animation of faces , 1998 .