Visual Coding and Tracking of Speech Related Facial Motion

We present a new method for video-based coding of facial motions inherent with speaking. We propose a set of four Facial Speech Parameters (FSP): jaw opening, lip rounding, lip closure, and lip raising, to represent the primary visual gestures in speech articulation. To generate a parametric model of facial actions, first a statistical model is developed by analyzing accurate 3D data of a reference human subject. The FSP are then associated to the linear modes of this statistical model resulting in a 3D parametric facial mesh that is linearly deformed using FSP. For tracking of talking facial motions, the parametric model is adapted and aligned to a subject's face. Then the face motion is tracked by optimally aligning the incoming video frames with the face model, textured with the first image, and deformed by varying the FSP, head rotations, and translations. Finer details of lip and skin deformation are modeled using a blend of textures into an appearance model. We show results of the tracking for different subjects using our method. Finally, we demonstrate the facial activity encoding into the four FSP values to represent speaker-independent phonetic information and to generate different styles of animation.

[1]  Thomas Vetter,et al.  A morphable model for the synthesis of 3D faces , 1999, SIGGRAPH.

[2]  Thoms M. Levergood,et al.  DEC face: an automatic lip-synchronization algorithm for synthetic faces , 1993 .

[3]  Jonas Beskow,et al.  Developing and evaluating conversational agents , 2001 .

[4]  Norman I. Badler,et al.  Final Report to Nsf of the Standards for Facial Animation Workshop Final Report to Nsf of the Standards for Facial Animation Workshop , 1994 .

[5]  Ken W Grant,et al.  Hearing by Eye II: Advances in the Psychology of Speechreading and Auditory–Visual Speech, edited by Ruth Campbell, Barbara Dodd, and Denis Burnham , 1999, Trends in Cognitive Sciences.

[6]  Gérard Bailly,et al.  MOTHER: a new generation of talking heads providing a flexible articulatory control for video-realistic speech animation , 2000, INTERSPEECH.

[7]  Alex Pentland,et al.  3D modeling and tracking of human lip motions , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[8]  David Salesin,et al.  Resynthesizing facial animation through 3D model-based tracking , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[9]  William H. Press,et al.  The Art of Scientific Computing Second Edition , 1998 .

[10]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[11]  Alex Pentland,et al.  Motion regularization for model-based head tracking , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[12]  平山亮 会議報告-Speechreading by Humans and Machines; Models Systems and Applications , 1997 .

[13]  Irfan Essa,et al.  Head Tracking Using a Textured Polygonal Model , 1998 .

[14]  J. Cassell,et al.  Embodied conversational agents , 2000 .

[15]  Alex Pentland,et al.  Automatic lipreading by optical-flow analysis , 1989 .

[16]  Timothy F. Cootes,et al.  Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Alex Pentland,et al.  Parametrized structure from motion for 3D adaptive feedback tracking of faces , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[18]  Alex Pentland,et al.  Coding, Analysis, Interpretation, and Recognition of Facial Expressions , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  Keith Waters,et al.  Computer facial animation , 1996 .

[20]  Demetri Terzopoulos,et al.  Analysis and Synthesis of Facial Image Sequences Using Physical and Anatomical Models , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[22]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[23]  Takeo Kanade,et al.  Subtly different facial expression recognition and expression intensity estimation , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[24]  R. Campbell,et al.  Hearing by eye 2 : advances in the psychology of speechreading and auditory-visual speech , 1997 .

[25]  Lionel Revéret,et al.  A New 3D Lip Model for Analysis and Synthesis of Lip Motion In Speech Production , 1998, AVSP.

[26]  Aljoscha Smolic,et al.  Real-time estimation of long-term 3-D motion parameters for SNHC face animation and model-based coding applications , 1999, IEEE Trans. Circuits Syst. Video Technol..

[27]  Marco La Cascia,et al.  Fast, reliable head tracking under varying illumination , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[28]  Kenji Mase,et al.  Recognition of Facial Expression from Optical Flow , 1991 .

[29]  Wei Wu,et al.  Compression of MPEG-4 facial animation parameters for transmission of talking heads , 1999, IEEE Trans. Circuits Syst. Video Technol..

[30]  Stan Sclaroff,et al.  Active blobs , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[31]  Stuart Geman,et al.  Statistical methods for tomographic image reconstruction , 1987 .

[32]  Juergen Luettin,et al.  Audio-Visual Speech Modelling for Continuous Speech Recognition , 2000 .

[33]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[34]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[35]  Peter Eisert,et al.  Analyzing Facial Expressions for Virtual Conferencing , 1998, IEEE Computer Graphics and Applications.

[36]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[37]  David G. Stork,et al.  Speechreading by Humans and Machines , 1996 .

[38]  A. Murat Tekalp,et al.  Face and 2-D mesh animation in MPEG-4 , 2000, Signal Process. Image Commun..