Visual speech synthesis from 3D video

In this paper we describe a parameterisation of lip movements which maintains the dynamic structure inherent in the task of producing speech sounds. A stereo capture system is used to reconstruct 3D models of a speaker producing sentences from the TIMIT corpus. This data is mapped into a space which maintains the relationships between samples and their temporal derivatives. By incorporating dynamic information within the parameterisation of lip movements we can model the cyclical structure, as well as the causal nature of speech movements as described by an underlying visual speech manifold. It is believed that such a structure will be appropriate to various areas of speech modeling, in particular the synthesis of speech lip movements.

[1]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[2]  Hans Peter Graf,et al.  Photo-Realistic Talking-Heads from Image Samples , 2000, IEEE Trans. Multim..

[3]  Paul Taylor,et al.  The architecture of the Festival speech synthesis system , 1998, SSW.

[4]  Ahmed M. Elgammal,et al.  High Resolution Acquisition, Learning and Transfer of Dynamic 3‐D Facial Expressions , 2004, Comput. Graph. Forum.

[5]  Adrian Hilton,et al.  Video-rate capture of dynamic face shape and appearance , 2004, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[6]  Nadia Magnenat-Thalmann,et al.  Visyllable Based Speech Animation , 2003, Comput. Graph. Forum.

[7]  Lucas Kovar,et al.  Motion graphs , 2002, SIGGRAPH '08.

[8]  Michael M. Cohen,et al.  Modeling Coarticulation in Synthetic Visual Speech , 1993 .

[9]  Anders Löfqvist,et al.  Speech as Audible Gestures , 1990 .

[10]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[11]  H SalesinDavid,et al.  Wavelets for Computer Graphics , 1995 .

[12]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[13]  Luc Van Gool,et al.  Realistic face animation for speech , 2002, Comput. Animat. Virtual Worlds.

[14]  Richard Szeliski,et al.  Video textures , 2000, SIGGRAPH.

[15]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[16]  Frédéric H. Pighin,et al.  Unsupervised learning for speech motion editing , 2003, SCA '03.

[17]  Tony Ezzat,et al.  Trainable videorealistic speech animation , 2002, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[18]  Li Zhang,et al.  Spacetime faces: high resolution capture for modeling and animation , 2004, SIGGRAPH 2004.

[19]  Amnon Shashua,et al.  Linear image coding for regression and classification using the tensor-rank principle , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[20]  E. J. Stollnitz,et al.  Wavelets for Computer Graphics : A Primer , 1994 .