Accurate automatic visible speech synthesis of arbitrary 3D models based on concatenation of diviseme motion capture data

We present a technique for accurate automatic visible speech synthesis from textual input. When provided with a speech waveform and the text of a spoken sentence, the system produces accurate visible speech synchronized with the audio signal. To develop the system, we collected motion capture data from a speaker's face during production of a set of words containing all diviseme sequences in English. The motion capture points from the speaker's face are retargeted to the vertices of the polygons of a 3D face model. When synthesizing a new utterance, the system locates the required sequence of divisemes, shrinks or expands each diviseme based on the desired phoneme segment durations in the target utterance, then moves the polygons in the regions of the lips and lower face to correspond to the spatial coordinates of the motion capture data. The motion mapping is realized by a key‐shape mapping function learned by a set of viseme examples in the source and target faces. A well‐posed numerical algorithm estimates the shape blending coefficients. Time warping and motion vector blending at the juncture of two divisemes and the algorithm to search the optimal concatenated visible speech are also developed to provide the final concatenative motion sequence. Copyright © 2004 John Wiley & Sons, Ltd.

[1]  Stephen J. Wright,et al.  Object-oriented software for quadratic programming , 2003, TOMS.

[2]  Kadri Hacioglu,et al.  Recent improvements in the CU Sonic ASR system for noisy speech: the SPINE task , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[3]  Demetri Terzopoulos,et al.  Physically-based facial modelling, analysis, and animation , 1990, Comput. Animat. Virtual Worlds.

[4]  Thomas Vetter,et al.  A morphable model for the synthesis of 3D faces , 1999, SIGGRAPH.

[5]  Raymond D. Kent,et al.  Coarticulation in recent speech production models , 1977 .

[6]  Luc Van Gool,et al.  Face animation based on observed 3D speech dynamics , 2001, Proceedings Computer Animation 2001. Fourteenth Conference on Computer Animation (Cat. No.01TH8596).

[7]  Ronald A. Cole,et al.  Animating visible speech and facial expressions , 2004, The Visual Computer.

[8]  Gregory M. Nielson,et al.  Scattered data modeling , 1993, IEEE Computer Graphics and Applications.

[9]  Daniel Thalmann,et al.  Abstract muscle action procedures for human face animation , 1988, The Visual Computer.

[10]  Tomaso A. Poggio,et al.  Linear Object Classes and Image Synthesis From a Single Example Image , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Jacques de Villiers,et al.  New tools for interactive speech and language training: Using animated conversational agents in the classrooms of profoundly deaf children , 1999 .

[12]  Lance Williams,et al.  Performance-driven facial animation , 1990, SIGGRAPH Courses.

[13]  Nadia Magnenat-Thalmann,et al.  Principal components of expressive speech animation , 2001, Proceedings. Computer Graphics International 2001.

[14]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[15]  Gung Feng,et al.  Data smoothing by cubic spline filters , 1998, IEEE Trans. Signal Process..

[16]  P. L. Jackson The Theoretical Minimal Unit for Visual Speech Perception: Visemes and Coarticulation. , 1988 .

[17]  Frederick I. Parke,et al.  Computer generated animation of faces , 1972, ACM Annual Conference.

[18]  N. Badler,et al.  Linguistic Issues in Facial Animation , 1991 .

[19]  Hans Peter Graf,et al.  Sample-based synthesis of photo-realistic talking heads , 1998, Proceedings Computer Animation '98 (Cat. No.98EX169).

[20]  Ronald A. Cole,et al.  Perceptive animated interfaces: first steps toward a new paradigm for human-computer interaction , 2003, Proc. IEEE.

[21]  Christoph Bregler,et al.  Facial expression space learning , 2002, 10th Pacific Conference on Computer Graphics and Applications, 2002. Proceedings..

[22]  Tony Ezzat,et al.  MikeTalk: a talking facial display based on morphing visemes , 1998, Proceedings Computer Animation '98 (Cat. No.98EX169).

[23]  Tony Ezzat,et al.  Face analysis for the synthesis of photo-realistic talking heads , 2000, Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580).

[24]  M. Woodward,et al.  Phoneme perception in lipreading. , 1960, Journal of speech and hearing research.

[25]  Michael M. Cohen,et al.  Modeling Coarticulation in Synthetic Visual Speech , 1993 .

[26]  Anders Löfqvist,et al.  Speech as Audible Gestures , 1990 .

[27]  David Salesin,et al.  Modeling and Animating Realistic Faces from Images , 2002, International Journal of Computer Vision.

[28]  Christoph Bregler,et al.  Turning to the masters: motion capturing cartoons , 2002, ACM Trans. Graph..

[29]  Erich Hartmann,et al.  Parametric Gn blending of curves and surfaces , 2001, The Visual Computer.

[30]  Tomaso A. Poggio,et al.  Reanimating Faces in Images and Video , 2003, Comput. Graph. Forum.