论文信息 - Real-time speech-driven lip synchronization

Real-time speech-driven lip synchronization

Speech-driven lip synchronization, an important part of facial animation, is to animate a face model to render lip movements that are synchronized with the acoustic speech signal. It has many applications in human-computer interaction. In this paper, we present a framework that systematically addresses multimodal database collection and processing and real-time speech-driven lip synchronization using collaborative filtering which is a data-driven approach used by many online retailers to recommend products. Mel-frequency cepstral coefficients (MFCCs) with their delta and acceleration coefficients and Facial Animation Parameters (FAPs) supported by MPEG-4 for the visual representation of speech are utilized as acoustic features and animation parameters respectively. The proposed system is speaker independent and real-time capable. The subjective experiments show that the proposed approach generates a natural facial animation.

Jianhua Tao | Minghao Yang | Jianfeng Che | Kaihui Mu

[1] Douglas B. Terry,et al. Using collaborative filtering to weave an information tapestry , 1992, CACM.

[2] H. McGurk,et al. Hearing lips and seeing voices , 1976, Nature.

[3] Satoshi Nakamura,et al. Lip movement synthesis from speech based on hidden Markov models , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[4] Tsuhan Chen,et al. Audio-visual integration in multimodal communication , 1998, Proc. IEEE.

[5] Fabio Lavagetto,et al. LIP movements synthesis using time delay neural networks , 1996, 1996 8th European Signal Processing Conference (EUSIPCO 1996).

[6] Thomas S. Huang,et al. Real-time speech-driven face animation with expressions using neural networks , 2002, IEEE Trans. Neural Networks.

[7] M. B. Stegmann,et al. A Brief Introduction to Statistical Shape Analysis , 2002 .

[8] A. Murat Tekalp,et al. Face and 2-D mesh animation in MPEG-4 , 2000, Signal Process. Image Commun..

[9] Kiyoharu Aizawa,et al. An intelligent facial image coding driven by speech and phoneme , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[10] Jianhua Tao,et al. Realistic Visual Speech Synthesis Based on Hybrid Concatenation Method , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[11] Satoshi Nakamura,et al. Lip movement synthesis from speech based on hidden Markov models , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[12] Matthew Brand,et al. Voice puppetry , 1999, SIGGRAPH.

[13] Jörn Ostermann,et al. Lifelike talking faces for interactive services , 2003, Proc. IEEE.

[14] R. RaoGeorgia,et al. EXPLOITING AUDIO-VISUAL CORRELATION IN CODING OFTALKING HEAD SEQUENCESRam , 1996 .

[15] John Lewis,et al. Automated lip-sync: Background and techniques , 1991, Comput. Animat. Virtual Worlds.

[16] Steve Young,et al. The HTK book , 1995 .

[17] Jonas Beskow,et al. Picture my voice: Audio to visual speech synthesis using artificial neural networks , 1999, AVSP.

[18] Yifan Gong,et al. A minimum-mean-square-error noise reduction algorithm on Mel-frequency cepstra for robust speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.