Lip synchronization using linear predictive analysis

Linear predictive analysis is a widely used technique for speech analysis and encoding. The authors discuss the issues involved in its application to phoneme extraction and lip synchronization. The LP analysis results in a set of reflection coefficients that are closely related to the vocal tract shape. Since the vocal tract shape can be correlated with the phoneme being spoken, LP analysis can be directly applied to phoneme extraction. We use neural networks to train and classify the reflection coefficients into a set of vowels. In addition, average energy is used to take care of vowel-vowel and vowel-consonant transitions, whereas the zero crossing information is used to detect the presence of fricatives. We directly apply the extracted phoneme information to our synthetic 3D face model. The proposed method is fast, easy to implement, and adequate for real time speech animation. As the method does not rely on language structure or speech recognition, it is language independent. Moreover, the method is speaker independent. It can be applied to lip synchronization for entertainment applications and avatar animation in virtual environments.

[1]  Keiichi Tokuda,et al.  Text-to-visual speech synthesis based on parameter generation from HMM , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[2]  David F. McAllister,et al.  Lip synchronization for animation , 1997, SIGGRAPH '97.

[3]  Shigeo Morishima,et al.  Real-time Talking Head Driven by Voice and its Application to Communication and Entertainment , 1998, AVSP.

[4]  David M. Skapura,et al.  Neural networks - algorithms, applications, and programming techniques , 1991, Computation and neural systems series.

[5]  Waveforms Hisashi Wakita Direct Estimation of the Vocal Tract Shape by Inverse Filtering of Acoustic Speech , 1973 .

[6]  J. P. Lewis,et al.  Automated lip-synch and speech synthesis for character animation , 1987, CHI '87.

[7]  Fabio Lavagetto,et al.  MPEG-4: Audio/video and synthetic graphics/audio for mixed media , 1997, Signal Process. Image Commun..

[8]  Fabio Lavagetto,et al.  LIP movements synthesis using time delay neural networks , 1996, 1996 8th European Signal Processing Conference (EUSIPCO 1996).

[9]  Satoshi Nakamura,et al.  Lip movement synthesis from speech based on hidden Markov models , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[10]  Keiichi Tokuda,et al.  Visual Speech Synthesis Based on Parameter Generation From HMM: Speech-Driven and Text-And-Speech-Driven Approaches , 1998, AVSP.

[11]  Nadia Magnenat-Thalmann,et al.  Facial deformations for MPEG-4 , 1998, Proceedings Computer Animation '98 (Cat. No.98EX169).

[12]  Ronald W. Schafer,et al.  Digital Processing of Speech Signals , 1978 .

[13]  Nadia Magnenat-Thalmann,et al.  MPEG-4 compatible faces from orthogonal photos , 1999, Proceedings Computer Animation 1999.