Expressive Speech Animation Synthesis with Phoneme‐Level Controls

This paper presents a novel data‐driven expressive speech animation synthesis system with phoneme‐level controls. This system is based on a pre‐recorded facial motion capture database, where an actress was directed to recite a pre‐designed corpus with four facial expressions (neutral, happiness, anger and sadness). Given new phoneme‐aligned expressive speech and its emotion modifiers as inputs, a constrained dynamic programming algorithm is used to search for best‐matched captured motion clips from the processed facial motion database by minimizing a cost function. Users optionally specify ‘hard constraints’ (motion‐node constraints for expressing phoneme utterances) and ‘soft constraints’ (emotion modifiers) to guide this search process. We also introduce a phoneme–Isomap interface for visualizing and interacting phoneme clusters that are typically composed of thousands of facial motion capture frames. On top of this novel visualization interface, users can conveniently remove contaminated motion subsequences from a large facial motion dataset. Facial animation synthesis experiments and objective comparisons between synthesized facial motion and captured motion showed that this system is effective for producing realistic expressive speech animations.

[1]  Christoph Bregler,et al.  Mood swings: expressive speech animation , 2005, TOGS.

[2]  Jun-yong Noh,et al.  Expression cloning , 2001, SIGGRAPH.

[3]  Frédéric H. Pighin,et al.  Unsupervised learning for speech motion editing , 2003, SCA '03.

[4]  Henrique S. Malvar,et al.  Making Faces , 2019, Topoi.

[5]  Christoph Bregler,et al.  Motion capture assisted animation: texturing and synthesis , 2002, ACM Trans. Graph..

[6]  Zoran Popovic,et al.  Motion warping , 1995, SIGGRAPH.

[7]  Nadia Magnenat-Thalmann,et al.  Visyllable Based Speech Animation , 2003, Comput. Graph. Forum.

[8]  Lucas Kovar,et al.  Motion Graphs , 2002, ACM Trans. Graph..

[9]  Tony Ezzat,et al.  Trainable videorealistic speech animation , 2002, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[10]  John P. Lewis,et al.  Expressive Facial Animation Synthesis by Learning Speech Coarticulation and Expression Spaces , 2006, IEEE Transactions on Visualization and Computer Graphics.

[11]  Zhigang Deng,et al.  Animating blendshape faces by cross-mapping motion capture data , 2006, I3D '06.

[12]  Zhigang Deng,et al.  Data-Driven 3D Facial Animation , 2007 .

[13]  Michael F. Cohen,et al.  Verbs and Adverbs: Multidimensional Motion Interpolation , 1998, IEEE Computer Graphics and Applications.

[14]  Daniel Thalmann,et al.  Models and Techniques in Computer Animation , 2014, Computer Animation Series.

[15]  Scott A. King,et al.  Creating speech-synchronized animation , 2005, IEEE Transactions on Visualization and Computer Graphics.

[16]  John P. Lewis,et al.  Synthesizing speech animation by learning compact speech co-articulation models , 2005, International 2005 Computer Graphics.

[17]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[18]  Lucas Kovar,et al.  Automated extraction and parameterization of motions in large data sets , 2004, ACM Trans. Graph..

[19]  TerzopoulosDemetri,et al.  Geometry-Driven Photorealistic Facial Expression Synthesis , 2006 .

[20]  Lance Williams,et al.  Performance-driven facial animation , 1990, SIGGRAPH.

[21]  Hans-Peter Seidel,et al.  Geometry-based Muscle Modeling for Facial Animation , 2001, Graphics Interface.

[22]  Hanspeter Pfister,et al.  Face transfer with multilinear models , 2005, ACM Trans. Graph..

[23]  Matthew Turk,et al.  A Morphable Model For The Synthesis Of 3D Faces , 1999, SIGGRAPH.

[24]  David Salesin,et al.  Synthesizing realistic facial expressions from photographs , 1998, SIGGRAPH.

[25]  Keith Waters,et al.  A coordinated muscle model for speech animation , 1995 .

[26]  Okan Arikan,et al.  Interactive motion generation from examples , 2002, ACM Trans. Graph..

[27]  Ronald Fedkiw,et al.  Automatic determination of facial muscle activations from sparse motion capture marker data , 2005, SIGGRAPH '05.

[28]  Keith Waters,et al.  Computer facial animation , 1996 .

[29]  C. Pelachaud Communication and coarticulation in facial animation , 1992 .

[30]  Zhigang Deng,et al.  Computer Facial Animation: A Survey , 2008 .

[31]  Demetri Terzopoulos,et al.  Realistic modeling for facial animation , 1995, SIGGRAPH.

[32]  Baining Guo,et al.  Geometry-driven photorealistic facial expression synthesis , 2003, IEEE Transactions on Visualization and Computer Graphics.

[33]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[34]  Aaron Hertzmann,et al.  Style-based inverse kinematics , 2004, ACM Trans. Graph..

[35]  Sung Yong Shin,et al.  An example-based approach for facial expression cloning , 2003, SCA '03.

[36]  Gerasimos Potamianos,et al.  Audio-visual unit selection for the synthesis of photo-realistic talking-heads , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[37]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[38]  Ronald A. Cole,et al.  Accurate visible speech synthesis based on concatenating variable length motion capture data , 2006, IEEE Transactions on Visualization and Computer Graphics.

[39]  K. B. Haley,et al.  Optimization Theory with Applications , 1970 .

[40]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[41]  John Lewis,et al.  Automated lip-sync: Background and techniques , 1991, Comput. Animat. Virtual Worlds.

[42]  Elisabetta Bevacqua,et al.  Expressive audio‐visual speech , 2004, Comput. Animat. Virtual Worlds.

[43]  Tomaso A. Poggio,et al.  Reanimating Faces in Images and Video , 2003, Comput. Graph. Forum.

[44]  Aaron Hertzmann,et al.  Style machines , 2000, SIGGRAPH 2000.

[45]  Nadia Magnenat-Thalmann,et al.  Feature Point Based Mesh Deformation Applied to MPEG-4 Facial Animation , 2000, DEFORM/AVATARS.

[46]  John P. Lewis,et al.  Pose Space Deformation: A Unified Approach to Shape Interpolation and Skeleton-Driven Deformation , 2000, SIGGRAPH.

[47]  Zhigang Deng,et al.  Eurographics/ Acm Siggraph Symposium on Computer Animation (2006) Efase: Expressive Facial Animation Synthesis and Editing with Phoneme-isomap Controls , 2022 .

[48]  Ronald A. Cole,et al.  Accurate automatic visible speech synthesis of arbitrary 3D models based on concatenation of diviseme motion capture data , 2004, Comput. Animat. Virtual Worlds.

[49]  Zhigang Deng,et al.  Natural head motion synthesis driven by acoustic prosodic features , 2005, Comput. Animat. Virtual Worlds.

[50]  Li Zhang,et al.  Spacetime faces: high resolution capture for modeling and animation , 2004, SIGGRAPH 2004.

[51]  Christoph Bregler,et al.  Facial expression space learning , 2002, 10th Pacific Conference on Computer Graphics and Applications, 2002. Proceedings..

[52]  Jessica K. Hodgins,et al.  Synthesizing physically realistic human motion in low-dimensional, behavior-specific spaces , 2004, ACM Trans. Graph..

[53]  Eddie Kohler,et al.  Real-time speech motion synthesis from recorded motions , 2004, SCA '04.

[54]  Harry Shum,et al.  Motion texture: a two-level statistical model for character motion synthesis , 2002, ACM Trans. Graph..