Trainable videorealistic speech animation

We describe how to create with machine learning techniques a generative, videorealistic, and speech animation module. A human subject is first recorded using a videocamera as he/she utters a pre-determined speech corpus. After processing the corpus automatically, a visual speech module is learned from the data that is capable of synthesizing the human subject's mouth uttering entirely novel utterances that were not recorded in the original video. The synthesized utterance is re-composited onto a background sequence, which contains natural head and eye movement. The final output is videorealistic in the sense that it looks like a video camera recording of the subject. At run time, the input to the system can be either real audio sequences or synthetic audio produced by a text-to-speech system, as long as they have been phonetically aligned.

[1]  Frederic I. Parke,et al.  A parametric model for human faces. , 1974 .

[2]  Berthold K. P. Horn,et al.  Determining Optical Flow , 1981, Other Conferences.

[3]  Edward H. Adelson,et al.  The Laplacian Pyramid as a Compact Image Code , 1983, IEEE Trans. Commun..

[4]  Brian Wyvill,et al.  Speech and expression: a computer solution to face animation , 1986 .

[5]  Keith Waters,et al.  A muscle model for animation three-dimensional facial expression , 1987, SIGGRAPH.

[6]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[7]  George Wolberg,et al.  Digital image warping , 1990 .

[8]  G. Wahba Spline models for observational data , 1990 .

[9]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[10]  P. Anandan,et al.  Hierarchical Model-Based Motion Estimation , 1992, ECCV.

[11]  T. Poggio,et al.  Recognition and Structure from one 2D Model View: Observations on Prototypes, Object Classes and Symmetries , 1992 .

[12]  Thaddeus Beier,et al.  Feature-based image metamorphosis , 1992, SIGGRAPH.

[13]  F. Girosi,et al.  From regularization to radial, tensor and additive splines , 1993, Neural Networks for Signal Processing III - Proceedings of the 1993 IEEE-SP Workshop.

[14]  Mei-Yuh Hwang,et al.  The SPHINX-II speech recognition system: an overview , 1993, Comput. Speech Lang..

[15]  Michael M. Cohen,et al.  Modeling Coarticulation in Synthetic Visual Speech , 1993 .

[16]  Tomaso Poggio,et al.  Example Based Image Analysis and Synthesis , 1993 .

[17]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[18]  Lance Williams,et al.  View Interpolation for Image Synthesis , 1993, SIGGRAPH.

[19]  Avon Ba Computer Graphics Animations of Talking Faces Based on Stochastic Models , 1994 .

[20]  John R. Wright,et al.  Synthesis of Speaker Facial Movement to Match Selected Speech Sequences , 1994 .

[21]  Sung Yong Shin,et al.  Image metamorphosis using snakes and free-form deformations , 1995, SIGGRAPH.

[22]  Timothy F. Cootes,et al.  A unified approach to coding and interpreting face images , 1995, Proceedings of IEEE International Conference on Computer Vision.

[23]  Demetri Terzopoulos,et al.  Realistic modeling for facial animation , 1995, SIGGRAPH.

[24]  Tomaso Poggio,et al.  Image Representations for Visual Learning , 1996, Science.

[25]  Bertrand Le Goff,et al.  A text-to-audiovisual-speech synthesizer for French , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[26]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[27]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[28]  Sam T. Roweis,et al.  EM Algorithms for PCA and SPCA , 1997, NIPS.

[29]  David Salesin,et al.  Synthesizing realistic facial expressions from photographs , 1998, SIGGRAPH.

[30]  Hans Peter Graf,et al.  Sample-based synthesis of photo-realistic talking heads , 1998, Proceedings Computer Animation '98 (Cat. No.98EX169).

[31]  Paul Taylor,et al.  Festival Speech Synthesis System , 1998 .

[32]  Sung Yong Shin,et al.  Polymorph: Morphing Among Multiple Images , 1998, IEEE Computer Graphics and Applications.

[33]  Keiichi Tokuda,et al.  Text-to-visual speech synthesis based on parameter generation from HMM , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[34]  Henrique S. Malvar,et al.  Making Faces , 2019, Topoi.

[35]  Tomaso A. Poggio,et al.  Multidimensional morphable models , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[36]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[37]  Christopher M. Bishop,et al.  Mixtures of Probabilistic Principal Component Analyzers , 1999, Neural Computation.

[38]  Thomas Vetter,et al.  A morphable model for the synthesis of 3D faces , 1999, SIGGRAPH.

[39]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[40]  Jonas Beskow,et al.  Wavesurfer - an open source speech tool , 2000, INTERSPEECH.

[41]  Aaron Hertzmann,et al.  Style machines , 2000, SIGGRAPH 2000.

[42]  David J. Fleet,et al.  Robustly Estimating Changes in Image Appearance , 2000, Comput. Vis. Image Underst..

[43]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[44]  Timothy F. Cootes,et al.  Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[45]  Tony Ezzat,et al.  Visual Speech Synthesis by Morphing Visemes , 2000, International Journal of Computer Vision.

[46]  David J. Fleet,et al.  Performance of optical flow techniques , 1994, International Journal of Computer Vision.

[47]  Frédéric H. Pighin,et al.  Synthesizing realistic facial expressions from photographs , 1998, SIGGRAPH Courses.