论文信息 - Speech Driven 3D Head Gesture Synthesis

Speech Driven 3D Head Gesture Synthesis

In this paper, we present a speech driven natural head gesture analysis and synthesis system. The proposed system assumes that sharp head movements are correlated with prominence in speech. For analysis, a binocular camera system is employed to capture the head motion of a talking person. The motion parameters associated with the 3D head motion are then used for extraction of the repetitive head gestures. In parallel, prosodic events are detected using an HMM structure with pitch and formant frequencies and speech intensity as audio features. For synthesis, the head motion parameters are estimated from the prosodic events based on a gesture-speech correlation model and then the associated Euler angles are used for speech driven animation of a 3D personalized talking head model. Results on head motion feature extraction, prosodic event detection and correlation modelling are provided

A.M. Tekalp | E. Erzin | Y. Yemez | A.T. Erdem | M.E. Sargin

[1] Jorge J. Moré,et al. The Levenberg-Marquardt algo-rithm: Implementation and theory , 1977 .

[2] Reinhard Koch,et al. A simple and efficient rectification method for general motion , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[3] A. Tanju Erdem,et al. A new method for generating 3-D face models for personalized user interaction , 2005, 2005 13th European Signal Processing Conference.

[4] Julia Hirschberg,et al. The Influence of Pitch Range, Duration, Amplitude and Spectral Features on the Interpretation of the Rise-Fall-Rise Intonation Contour in English , 1992 .

[5] Trevor Darrell,et al. Motion estimation from disparity images , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[6] Stefanie Shattuck-Hufnagel,et al. THE TIMING OF SPEECH-ACCOMPANYING GESTURES WITH RESPECT TO PROSODY , 2004 .

[7] Fabio Tamburini,et al. Prosodic prominence detection in speech , 2003, Seventh International Symposium on Signal Processing and Its Applications, 2003. Proceedings..

[8] P. Boersma. ACCURATE SHORT-TERM ANALYSIS OF THE FUNDAMENTAL FREQUENCY AND THE HARMONICS-TO-NOISE RATIO OF A SAMPLED SOUND , 1993 .

[9] H. McGurk,et al. Hearing lips and seeing voices , 1976, Nature.

[10] A. Murat Tekalp,et al. Combined Gesture-Speech Analysis and Speech Driven Gesture Synthesis , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[11] Stephen Wilson,et al. Combined Gesture-Speech Analysis and Synthesis , 2005 .