Model-Based Synthesis of Visual Speech Movements from 3D Video

We describe a method for the synthesis of visual speech movements using a hybrid unit selection/model-based approach. Speech lip movements are captured using a 3D stereo face capture system and split up into phonetic units. A dynamic parameterisation of this data is constructed which maintains the relationship between lip shapes and velocities; within this parameterisation a model of how lips move is built and is used in the animation of visual speech movements from speech audio input. The mapping from audio parameters to lip movements is disambiguated by selecting only the most similar stored phonetic units to the target utterance during synthesis. By combining properties of model-based synthesis (e.g., HMMs, neural nets) with unit selection we improve the quality of our speech synthesis.

[1]  Zhigang Deng,et al.  Eurographics/ Acm Siggraph Symposium on Computer Animation (2006) Efase: Expressive Facial Animation Synthesis and Editing with Phoneme-isomap Controls , 2022 .

[2]  Thierry Dutoit,et al.  The MBROLA project: towards a set of high quality speech synthesizers free of use for non commercial purposes , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[3]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[4]  Barry-John Theobald,et al.  A probabilistic trajectory synthesis system for synthesising visual speech , 2008, INTERSPEECH.

[5]  E. Vatikiotis-Bateson,et al.  Analysis and modeling of 3D jaw motion in speech and mastication , 1999, IEEE SMC'99 Conference Proceedings. 1999 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No.99CH37028).

[6]  Pascal Müller,et al.  Realistic speech animation based on observed 3-D face dynamics , 2005 .

[7]  Tony Ezzat,et al.  Videorealistic talking faces: a morphing approach , 1997, AVSP.

[8]  Adrian Hilton,et al.  Video-rate capture of dynamic face shape and appearance , 2004, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[9]  C. G. Fisher,et al.  Confusions among visually perceived consonants. , 1968, Journal of speech and hearing research.

[10]  John P. Lewis,et al.  Expressive Facial Animation Synthesis by Learning Speech Coarticulation and Expression Spaces , 2006, IEEE Transactions on Visualization and Computer Graphics.

[11]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[12]  Steven M. Seitz,et al.  Spacetime faces , 2004, ACM Trans. Graph..

[13]  Hans-Peter Seidel,et al.  Speech Synchronization for Physics-Based Facial Animation , 2002, WSCG.

[14]  Anders Löfqvist,et al.  Speech as Audible Gestures , 1990 .

[15]  P. Mermelstein,et al.  Distance measures for speech recognition, psychological and instrumental , 1976 .

[16]  Zhengyou Zhang,et al.  Iterative point matching for registration of free-form curves and surfaces , 1994, International Journal of Computer Vision.

[17]  S. Ohman Numerical model of coarticulation. , 1967, The Journal of the Acoustical Society of America.

[18]  Nadia Magnenat-Thalmann,et al.  Visyllable Based Speech Animation , 2003, Comput. Graph. Forum.

[19]  Timothy F. Cootes,et al.  Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  Heloir,et al.  The Uncanny Valley , 2019, The Animation Studies Reader.

[21]  Tomaso Poggio,et al.  Trainable Videorealistic Speech Animation , 2004, FGR.

[22]  Frédéric H. Pighin,et al.  Expressive speech-driven facial animation , 2005, TOGS.

[23]  Gérard Bailly,et al.  A new trainable trajectory formation system for facial animation , 2006, ExLing.

[24]  Hans Peter Graf,et al.  Sample-based synthesis of photo-realistic talking heads , 1998, Proceedings Computer Animation '98 (Cat. No.98EX169).

[25]  Paul Taylor,et al.  Festival Speech Synthesis System , 1998 .

[26]  Jonas Beskow,et al.  Picture my voice: Audio to visual speech synthesis using artificial neural networks , 1999, AVSP.

[27]  Adrian Hilton,et al.  Parameterisation of 3d speech lip movements , 2008, AVSP.

[28]  Dominic W. Massaro,et al.  Training a talking head , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[29]  Sam T. Roweis,et al.  EM Algorithms for PCA and SPCA , 1997, NIPS.

[30]  Michael M. Cohen,et al.  Modeling Coarticulation in Synthetic Visual Speech , 1993 .

[31]  D J Ostry,et al.  An examination of the degrees of freedom of human jaw motion in speech and mastication. , 1997, Journal of speech, language, and hearing research : JSLHR.

[32]  Gérard Bailly,et al.  MOTHER: a new generation of talking heads providing a flexible articulatory control for video-realistic speech animation , 2000, INTERSPEECH.

[33]  Le Zhang,et al.  Acoustic-Articulatory Modeling With the Trajectory HMM , 2008, IEEE Signal Processing Letters.