Synthesizing speech animation by learning compact speech co-articulation models

While speech animation fundamentally consists of a sequence of phonemes over time, sophisticated animation requires smooth interpolation and co-articulation effects, where the preceding and following phonemes influence the shape of a phoneme. Co-articulation has been approached in speech animation research in several ways, most often by simply smoothing the mouth geometry motion over time. Data-driven approaches tend to generate realistic speech animation, but they need to store a large facial motion database, which is not feasible for real time gaming and interactive applications on platforms such as PDAs and cell phones. In this paper we show that accurate speech co-articulation model with compact size can be learned from facial motion capture data. An initial phoneme sequence is generated automatically from text-to-speech (TTS) systems. Then, our learned co-articulation model is applied to the resulting phoneme sequence, producing natural and detailed motion. The contribution of this work is that speech co-articulation models "learned" from real human motion data can be used to generate natural-looking speech motion while simultaneously preserving the expressiveness of the animation via keyframing control. Simultaneously, this approach can be effectively applied to interactive applications due to its compact size.

[1]  P. Ekman,et al.  Unmasking the face : a guide to recognizing emotions from facial clues , 1975 .

[2]  M. Omizo,et al.  Modeling , 1983, Encyclopedic Dictionary of Archaeology.

[3]  Brian Wyvill,et al.  Speech and expression: a computer solution to face animation , 1986 .

[4]  Demetri Terzopoulos,et al.  Physically-based facial modelling, analysis, and animation , 1990, Comput. Animat. Virtual Worlds.

[5]  John Lewis,et al.  Automated lip-sync: Background and techniques , 1991, Comput. Animat. Virtual Worlds.

[6]  C. Pelachaud Communication and coarticulation in facial animation , 1992 .

[7]  Michael M. Cohen,et al.  Modeling Coarticulation in Synthetic Visual Speech , 1993 .

[8]  Keith Waters,et al.  A coordinated muscle model for speech animation , 1995 .

[9]  Demetri Terzopoulos,et al.  Realistic modeling for facial animation , 1995, SIGGRAPH.

[10]  Keith Waters,et al.  Computer facial animation , 1996 .

[11]  Alex Pentland,et al.  Modeling, tracking and interactive animation of faces and heads//using input from video , 1996, Proceedings Computer Animation '96.

[12]  Bertrand Le Goff,et al.  A text-to-audiovisual-speech synthesizer for French , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[13]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[14]  Bobby Bodenheimer,et al.  The Process of Motion Capture: Dealing with the Data , 1997, Computer Animation and Simulation.

[15]  Sam T. Roweis,et al.  EM Algorithms for PCA and SPCA , 1997, NIPS.

[16]  Matthew Stone,et al.  An anthropometric face model using variational techniques , 1998, SIGGRAPH.

[17]  Alex Pentland,et al.  3D modeling and tracking of human lip motions , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[18]  Keiichi Tokuda,et al.  Text-to-visual speech synthesis based on parameter generation from HMM , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[19]  Henrique S. Malvar,et al.  Making Faces , 2019, Topoi.

[20]  John Yen,et al.  Emotionally expressive agents , 1999, Proceedings Computer Animation 1999.

[21]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[22]  Björn Granström,et al.  Synthetic visual speech driven from auditory speech , 1999, AVSP.

[23]  Matthew Turk,et al.  A Morphable Model For The Synthesis Of 3D Faces , 1999, SIGGRAPH.

[24]  Thomas Vetter,et al.  A morphable model for the synthesis of 3D faces , 1999, SIGGRAPH.

[25]  Hyeong-Seok Ko,et al.  Analysis and synthesis of facial expressions with hand-generated muscle actuation basis , 2001, Proceedings Computer Animation 2001. Fourteenth Conference on Computer Animation (Cat. No.01TH8596).

[26]  Hans-Peter Seidel,et al.  Geometry-based Muscle Modeling for Facial Animation , 2001, Graphics Interface.

[27]  Luc Van Gool,et al.  Face animation based on observed 3D speech dynamics , 2001, Proceedings Computer Animation 2001. Fourteenth Conference on Computer Animation (Cat. No.01TH8596).

[28]  Jun-yong Noh,et al.  Expression cloning , 2001, SIGGRAPH 2001.

[29]  Erika Chuang,et al.  Performance Driven Facial Animation using Blendshape Interpolation , 2002 .

[30]  Ulrich Neumann,et al.  Interactive multiresolution hair modeling and editing , 2002, SIGGRAPH.

[31]  Christoph Bregler,et al.  Turning to the masters: motion capturing cartoons , 2002, ACM Trans. Graph..

[32]  Baining Guo,et al.  Interactive multiresolution hair modeling and editing , 2002, ACM Trans. Graph..

[33]  BreglerChristoph,et al.  Turning to the masters , 2002 .

[34]  Aggelos K. Katsaggelos,et al.  An HMM-based speech-to-video synthesizer , 2002, IEEE Trans. Neural Networks.

[35]  Thomas S. Huang,et al.  Real-time speech-driven face animation with expressions using neural networks , 2002, IEEE Trans. Neural Networks.

[36]  NeumannUlrich,et al.  Interactive multiresolution hair modeling and editing , 2002 .

[37]  Norman I. Badler,et al.  Eyes alive , 2002, ACM Trans. Graph..

[38]  Nadia Magnenat-Thalmann,et al.  Visyllable Based Speech Animation , 2003, Comput. Graph. Forum.

[39]  Tomaso A. Poggio,et al.  Reanimating Faces in Images and Video , 2003, Comput. Graph. Forum.

[40]  Zhigang Deng,et al.  Automatic Dynamic Expression Synthesis For Speech Animation , 2004 .

[41]  Tomaso Poggio,et al.  Trainable Videorealistic Speech Animation , 2004, FGR.

[42]  Frédéric H. Pighin,et al.  Synthesizing realistic facial expressions from photographs , 2005, SIGGRAPH Courses.

[43]  John P. Lewis,et al.  Automated eye motion using texture synthesis , 2005, IEEE Computer Graphics and Applications.

[44]  Frédéric H. Pighin,et al.  Synthesizing realistic facial expressions from photographs , 1998, SIGGRAPH Courses.

[45]  Henrique S. Malvar,et al.  Making faces , 1998, SIGGRAPH Courses.