Model-based synthesis of visual speech movements from 3D video

We describe a method for the synthesis of visual speech movements using a hybrid unit selection/model-based approach. Speech lip movements are captured using a 3D stereo face capture system and split up into phonetic units. A dynamic parameterisation of this data is constructed which maintains the relationship between lip shapes and velocities; within this parameterisation a model of how lips move is built and is used in the animation of visual speech movements from speech audio input. The mapping from audio parameters to lip movements is disambiguated by selecting only the most similar stored phonetic units to the target utterance during synthesis. By combining properties of model-based synthesis (e.g., HMMs, neural nets) with unit selection we improve the quality of our speech synthesis.

[1]  Thierry Dutoit,et al.  The MBROLA project: towards a set of high quality speech synthesizers free of use for non commercial purposes , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[2]  Paul Taylor,et al.  Festival Speech Synthesis System , 1998 .

[3]  D J Ostry,et al.  An examination of the degrees of freedom of human jaw motion in speech and mastication. , 1997, Journal of speech, language, and hearing research : JSLHR.

[4]  P. Mermelstein,et al.  Distance measures for speech recognition, psychological and instrumental , 1976 .

[5]  Dominic W. Massaro,et al.  Training a talking head , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[6]  Jonas Beskow,et al.  Picture my voice: Audio to visual speech synthesis using artificial neural networks , 1999, AVSP.

[7]  S. Ohman Numerical model of coarticulation. , 1967, The Journal of the Acoustical Society of America.

[8]  Tony Ezzat,et al.  Videorealistic talking faces: a morphing approach , 1997, AVSP.

[9]  Sam T. Roweis,et al.  EM Algorithms for PCA and SPCA , 1997, NIPS.

[10]  Barry-John Theobald,et al.  A probabilistic trajectory synthesis system for synthesising visual speech , 2008, INTERSPEECH.

[11]  Nadia Magnenat-Thalmann,et al.  Visyllable Based Speech Animation , 2003, Comput. Graph. Forum.

[12]  C. G. Fisher,et al.  Confusions among visually perceived consonants. , 1968, Journal of speech and hearing research.

[13]  Le Zhang,et al.  Acoustic-Articulatory Modeling With the Trajectory HMM , 2008, IEEE Signal Processing Letters.

[14]  Michael M. Cohen,et al.  Modeling Coarticulation in Synthetic Visual Speech , 1993 .

[15]  Anders Löfqvist,et al.  Speech as Audible Gestures , 1990 .

[16]  Gérard Bailly,et al.  MOTHER: a new generation of talking heads providing a flexible articulatory control for video-realistic speech animation , 2000, INTERSPEECH.

[17]  E. Vatikiotis-Bateson,et al.  Analysis and modeling of 3D jaw motion in speech and mastication , 1999, IEEE SMC'99 Conference Proceedings. 1999 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No.99CH37028).

[18]  Timothy F. Cootes,et al.  Active Appearance Models , 1998, ECCV.

[19]  Li Zhang,et al.  Spacetime faces: high resolution capture for modeling and animation , 2004, SIGGRAPH 2004.

[20]  Zhengyou Zhang,et al.  Iterative point matching for registration of free-form curves and surfaces , 1994, International Journal of Computer Vision.

[21]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[22]  Heloir,et al.  The Uncanny Valley , 2019, The Animation Studies Reader.

[23]  Pascal Müller,et al.  Realistic speech animation based on observed 3-D face dynamics , 2005 .

[24]  Zhigang Deng,et al.  Eurographics/ Acm Siggraph Symposium on Computer Animation (2006) Efase: Expressive Facial Animation Synthesis and Editing with Phoneme-isomap Controls , 2022 .

[25]  Adrian Hilton,et al.  Parameterisation of 3d speech lip movements , 2008, AVSP.

[26]  Gérard Bailly,et al.  A new trainable trajectory formation system for facial animation , 2006, ExLing.

[27]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[28]  Adrian Hilton,et al.  Video-rate capture of dynamic face shape and appearance , 2004, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[29]  Tony Ezzat,et al.  Trainable videorealistic speech animation , 2002, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[30]  John P. Lewis,et al.  Expressive Facial Animation Synthesis by Learning Speech Coarticulation and Expression Spaces , 2006, IEEE Transactions on Visualization and Computer Graphics.

[31]  Hans Peter Graf,et al.  Sample-based synthesis of photo-realistic talking heads , 1998, Proceedings Computer Animation '98 (Cat. No.98EX169).

[32]  Frédéric H. Pighin,et al.  Expressive speech-driven facial animation , 2005, TOGS.

[33]  Hans-Peter Seidel,et al.  Speech Synchronization for Physics-Based Facial Animation , 2002, WSCG.