Voice puppetry

We introduce a method for predicting a control signal from another related signal, and apply it to voice puppetry: Generating full facial animation from expressive information in an audio track. The voice puppet learns a facial control model from computer vision of real facial behavior, automatically incorporating vocal and facial dynamics such as co-articulation. Animation is produced by using audio to drive the model, which induces a probability distribution over the manifold of possible facial motions. We present a lineartime closed-form solution for the most probable trajectory over this manifold. The output is a series of facial control parameters, suitable for driving many different kinds of animation ranging from video-realistic image warps to 3D cartoon characters. CR Categories: I.3.7 [Computer Graphics]: Three-Dimensional Graphics and Realism—Animation; I.2.9 [Artificial Intelligence]: Robotics—Kinematics and Dynamics; I.4.8 [Image Processing and Computer Vision]: Scene Analysis—Time-varying images; G.3 [Mathematics of Computing]: Probability and Statistics—Time series analysis; E.4 [Data]: Coding and Information Theory—Data compaction and compression; J.5 [Computer Applications]: Arts and Humanities—Performing Arts

[1]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[2]  Frederic I. Parke,et al.  A model for human faces that allows speech synchronized animation , 1974, SIGGRAPH '74.

[3]  Frederic I. Parke,et al.  A parametric model for human faces. , 1974 .

[4]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[5]  Gene H. Golub,et al.  Matrix computations , 1983 .

[6]  John Lewis,et al.  Automated lip-sync: Background and techniques , 1991, Comput. Animat. Virtual Worlds.

[7]  Hiroshi Harashima,et al.  A Media Conversion from Speech to Facial Image for Intelligent Man-Machine Interface , 1991, IEEE J. Sel. Areas Commun..

[8]  Michael M. Cohen,et al.  Modeling Coarticulation in Synthetic Visual Speech , 1993 .

[9]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[10]  Fabio Lavagetto,et al.  LIP movements synthesis using time delay neural networks , 1996, 1996 8th European Signal Processing Conference (EUSIPCO 1996).

[11]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[12]  Tsuhan Chen,et al.  Audio-visual interaction in multimedia communication , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Gregory D. Hager,et al.  X Vision: A Portable Substrate for Real-Time Vision Applications , 1998, Comput. Vis. Image Underst..

[14]  Satoshi Nakamura,et al.  Lip movement synthesis from speech based on hidden Markov models , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[15]  Tony Ezzat,et al.  MikeTalk: a talking facial display based on morphing visemes , 1998, Proceedings Computer Animation '98 (Cat. No.98EX169).

[16]  Matthew Brand,et al.  Shadow puppetry , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[17]  Dan Ling,et al.  Spoken Language Processing in the Persona Conversational Assistant , 1999 .

[18]  Matthew Brand,et al.  Pattern discovery via entropy minimization , 1999, AISTATS.

[19]  Matthew Brand,et al.  Structure Learning in Conditional Probability Models via an Entropic Prior and Parameter Extinction , 1999, Neural Computation.