Real-time speech-driven lip synchronization

Speech-driven lip synchronization, an important part of facial animation, is to animate a face model to render lip movements that are synchronized with the acoustic speech signal. It has many applications in human-computer interaction. In this paper, we present a framework that systematically addresses multimodal database collection and processing and real-time speech-driven lip synchronization using collaborative filtering which is a data-driven approach used by many online retailers to recommend products. Mel-frequency cepstral coefficients (MFCCs) with their delta and acceleration coefficients and Facial Animation Parameters (FAPs) supported by MPEG-4 for the visual representation of speech are utilized as acoustic features and animation parameters respectively. The proposed system is speaker independent and real-time capable. The subjective experiments show that the proposed approach generates a natural facial animation.

[1]  Douglas B. Terry,et al.  Using collaborative filtering to weave an information tapestry , 1992, CACM.

[2]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[3]  Satoshi Nakamura,et al.  Lip movement synthesis from speech based on hidden Markov models , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[4]  Tsuhan Chen,et al.  Audio-visual integration in multimodal communication , 1998, Proc. IEEE.

[5]  Fabio Lavagetto,et al.  LIP movements synthesis using time delay neural networks , 1996, 1996 8th European Signal Processing Conference (EUSIPCO 1996).

[6]  Thomas S. Huang,et al.  Real-time speech-driven face animation with expressions using neural networks , 2002, IEEE Trans. Neural Networks.

[7]  M. B. Stegmann,et al.  A Brief Introduction to Statistical Shape Analysis , 2002 .

[8]  A. Murat Tekalp,et al.  Face and 2-D mesh animation in MPEG-4 , 2000, Signal Process. Image Commun..

[9]  Kiyoharu Aizawa,et al.  An intelligent facial image coding driven by speech and phoneme , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[10]  Jianhua Tao,et al.  Realistic Visual Speech Synthesis Based on Hybrid Concatenation Method , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Satoshi Nakamura,et al.  Lip movement synthesis from speech based on hidden Markov models , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[12]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[13]  Jörn Ostermann,et al.  Lifelike talking faces for interactive services , 2003, Proc. IEEE.

[14]  R. RaoGeorgia,et al.  EXPLOITING AUDIO-VISUAL CORRELATION IN CODING OFTALKING HEAD SEQUENCESRam , 1996 .

[15]  John Lewis,et al.  Automated lip-sync: Background and techniques , 1991, Comput. Animat. Virtual Worlds.

[16]  Steve Young,et al.  The HTK book , 1995 .

[17]  Jonas Beskow,et al.  Picture my voice: Audio to visual speech synthesis using artificial neural networks , 1999, AVSP.

[18]  Yifan Gong,et al.  A minimum-mean-square-error noise reduction algorithm on Mel-frequency cepstra for robust speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.