Co-articulation generation using maximum direction change and apparent motion for Chinese visual speech synthesis

This study presents an approach for automated lip synchronization and smoothing for Chinese visual speech synthesis. A facial animation system with synchronization algorithm is also developed to visualize an existent Text-To-Speech system. Motion parameters for each viseme are first constructed from video footage of a human speaker. To synchronize the parameter set sequence and speech signal, a maximum direction change algorithm is also proposed to select significant parameter set sequences according to the speech duration. Moreover, to improve the smoothness of co-articulation part under a high speaking rate, four phoneme-dependent co-articulation functions are generated by integrating the Bernstein-Bézier curve and apparent motion property. A Chinese visual speech synthesis system is built to evaluate the proposed approach. The synthesis result of the proposed system is compared to the real speaker. The coarticulation generated by the proposed approach is also evaluated.

[1]  Thierry Guiard-Marigny,et al.  3D Models of the Lips and Jaw for Visual Speech Synthesis , 1997 .

[2]  Jörn Ostermann,et al.  Animated Talking Head with Personalized 3D Head Model , 1998, J. VLSI Signal Process..

[3]  Catherine Pelachaud,et al.  From Greta's mind to her face: modelling the dynamics of affective states in a conversational embodied agent , 2003, Int. J. Hum. Comput. Stud..

[4]  Cynthia LeRouge,et al.  Developing multimodal intelligent affective interfaces for tele-home health care , 2003, Int. J. Hum. Comput. Stud..

[5]  Gérard Bailly,et al.  Audiovisual Speech Synthesis , 2003, Int. J. Speech Technol..

[6]  Georg Trogemann,et al.  Animated interactive fiction: Storytelling by a conversational virtual actor , 1997, Proceedings. International Conference on Virtual Systems and MultiMedia VSMM '97 (Cat. No.97TB100182).

[7]  Daniel Thalmann,et al.  Abstract muscle action procedures for human face animation , 1988, The Visual Computer.

[8]  Yasuhiro Katagiri,et al.  TelMeA - Expressive avatars in asynchronous communications , 2005, Int. J. Hum. Comput. Stud..

[9]  Christian Bouville,et al.  FaceEngine a 3D facial animation engine for real time applications , 2001, Web3D '01.

[10]  Timothy F. Cootes,et al.  Active Appearance Models , 1998, ECCV.

[11]  Hans Peter Graf,et al.  Photo-Realistic Talking-Heads from Image Samples , 2000, IEEE Trans. Multim..

[12]  T. Kaburagi,et al.  Articulatory movement formation by kinematic triphone model , 1999, IEEE SMC'99 Conference Proceedings. 1999 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No.99CH37028).

[13]  Takao Kobayashi,et al.  Text-to-audio-visual speech synthesis based on parameter generation from HMM , 1999, EUROSPEECH.

[14]  B. Barsky,et al.  An Introduction to Splines for Use in Computer Graphics and Geometric Modeling , 1987 .

[15]  Wen Gao,et al.  Individual 3D face synthesis based on orthogonal photos and speech-driven facial animation , 2000, Proceedings 2000 International Conference on Image Processing (Cat. No.00CH37101).

[16]  Yan Zhuo,et al.  Contributions of the Visual Ventral Pathway to Long-Range Apparent Motion , 2003, Science.

[17]  J.D.R. Wey,et al.  Interface: a real time facial animation system , 1998, Proceedings SIBGRAPI'98. International Symposium on Computer Graphics, Image Processing, and Vision (Cat. No.98EX237).

[18]  Michael M. Cohen,et al.  Modeling Coarticulation in Synthetic Visual Speech , 1993 .

[19]  E. Vatikiotis-Bateson,et al.  Kinematics-Based Synthesis of Realistic Talking Faces , 1998, AVSP.