Svm-based Phoneme Classification and Lip Shape Refinement in Real-time Lip-synch System

In this paper, we present a real time lip-synch system that activates 2-D avatar's lip motion in synch with incoming speech utterance. To achieve the real time operation of the system, the processing time was minimized by "merge and split" procedures resulting in coarse-to-fine phoneme classification. At each stage of phoneme classification, the support vector machine (SVM) method was applied to reduce the computational load while maintaining the desired accuracy. The coarse-to-fine phoneme classification, is accomplished via two_stages of feature extraction: in the first stage, each speech frame is acoustically analyzed for three classes of lip opening using Mel Frequency Cepstral Coefficients (MFCC) as a feature; in the second stage, each frame is further refined for detailed lip shape using formant information. The method was implemented in 2-D lip animation and it was demonstrated that the system was effective in accomplishing real-time lip-synch. This approach was tested on a PC using the Microsoft Visual Studio with an Intel Pentium IV 1.4 Giga Hz CPU and 384 MB RAM. It was observed that the methods of phoneme merging and SVM achieved about twice the speed in recognition than the method employing the Hidden Markov Model (HMM). A typical latency time per a single frame observed using the proposed method was in the order of 18.22 milliseconds while an HMM method under identical conditions resulted about 30.67 milliseconds.

[1]  Régine André-Obrecht,et al.  A new statistical approach for the automatic segmentation of continuous speech signals , 1988, IEEE Trans. Acoust. Speech Signal Process..

[2]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[3]  Achim V. Brandt,et al.  Detecting and estimating parameter jumps using ladder algorithms and likelihood ratio tests , 1983, ICASSP.

[4]  Pedro J. Moreno,et al.  On the use of support vector machines for phonetic classification , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[5]  Jong-Jin Kim,et al.  A Study on the Design and the Construction of a Korean Speech DB for Common Use , 1997 .

[6]  Shigeki Sagayama,et al.  Support vector machine with dynamic time-alignment kernel for speech recognition , 2001, INTERSPEECH.

[7]  David F. McAllister,et al.  Lip synchronization for animation , 1997, SIGGRAPH '97.

[8]  Steven E. Golowich,et al.  A Support Vector/Hidden Markov Model Approach to Phoneme Recognition , 1998 .

[9]  Satoshi Nakamura,et al.  Lip movement synthesis from speech based on hidden Markov models , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[10]  Ming-Tzaw Lin,et al.  Consonant/vowel segmentation for Mandarin syllable recognition , 1999, Comput. Speech Lang..

[11]  Fabio Lavagetto,et al.  LIP movements synthesis using time delay neural networks , 1996, 1996 8th European Signal Processing Conference (EUSIPCO 1996).