Analysis of Head Gesture and Prosody Patterns for Prosody-Driven Head-Gesture Animation

We propose a new two-stage framework for joint analysis of head gesture and speech prosody patterns of a speaker towards automatic realistic synthesis of head gestures from speech prosody. In the first stage analysis, we perform hidden Markov model (HMM) based unsupervised temporal segmentation of head gesture and speech prosody features separately to determine elementary head gesture and speech prosody patterns, respectively, for a particular speaker. In the second stage, joint analysis of correlations between these elementary head gesture and prosody patterns is performed using Multi-Stream HMMs to determine an audio-visual mapping model. The resulting audio-visual mapping model is then employed to synthesize natural head gestures from arbitrary input test speech given a head model for the speaker. In the synthesis stage, the audio-visual mapping model is used to predict a sequence of gesture patterns from the prosody pattern sequence computed for the input test speech. The Euler angles associated with each gesture pattern are then applied to animate the speaker head model. Objective and subjective evaluations indicate that the proposed synthesis by analysis scheme provides natural looking head gestures for the speaker with any input test speech, as well as in ``prosody transplant" and ``gesture transplant" scenarios.

[1]  J. M. Gerzso,et al.  Computer graphics and interactive techniques: 15th-17th July 1974. Boulder, Colorado, USA. Sponsored by the University of Colorado Computing Centre and ACM/SIGGRAPH , 1975, Comput. Aided Des..

[2]  Ken Shoemake,et al.  Animating rotation with quaternion curves , 1985, SIGGRAPH.

[3]  Kiyoharu Aizawa,et al.  An intelligent facial image coding driven by speech and phoneme , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[4]  Pascal Fua,et al.  Combining Stereo and Monocular Information to Compute Dense Depth Maps that Preserve Depth Discontinuities , 1991, IJCAI.

[5]  Mari Ostendorf,et al.  TOBI: a standard for labeling English prosody , 1992, ICSLP.

[6]  P. Boersma ACCURATE SHORT-TERM ANALYSIS OF THE FUNDAMENTAL FREQUENCY AND THE HARMONICS-TO-NOISE RATIO OF A SAMPLED SOUND , 1993 .

[7]  Yoshua Bengio,et al.  Input-output HMMs for sequence processing , 1996, IEEE Trans. Neural Networks.

[8]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[9]  Satoshi Nakamura,et al.  Lip movement synthesis from speech based on hidden Markov models , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[10]  Tsuhan Chen,et al.  Real-time lip-synch face animation driven by human voice , 1998, 1998 IEEE Second Workshop on Multimedia Signal Processing (Cat. No.98EX175).

[11]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[12]  Takaaki Kuratate,et al.  Audio-visual synthesis of talking faces from speech production correlates. , 1999 .

[13]  J.-Y. Bouguet,et al.  Pyramidal implementation of the lucas kanade feature tracker , 1999 .

[14]  Francis Quek,et al.  Gesture cues for conversational interaction in monocular video , 1999, Proceedings International Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems. In Conjunction with ICCV'99 (Cat. No.PR00378).

[15]  Trevor Darrell,et al.  Motion Estimation from Disparity Images , 2001, ICCV.

[16]  Tsuhan Chen,et al.  Audiovisual speech processing , 2001, IEEE Signal Process. Mag..

[17]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[18]  Milind R. Naphade,et al.  Discovering recurrent events in video using unsupervised methods , 2002, Proceedings. International Conference on Image Processing.

[19]  Volker Strom,et al.  Visual prosody: facial movements accompanying speech , 2002, Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition.

[20]  Samy Bengio,et al.  Torch: a modular machine learning software library , 2002 .

[21]  Rainer Lienhart,et al.  An extended set of Haar-like features for rapid object detection , 2002, Proceedings. International Conference on Image Processing.

[22]  Jonathan H. Manton,et al.  Optimization algorithms exploiting unitary constraints , 2002, IEEE Trans. Signal Process..

[23]  Rashid Ansari,et al.  Multimodal signal analysis of prosody and hand motion: Temporal correlation of speech and gestures , 2002, 2002 11th European Signal Processing Conference.

[24]  Aggelos K. Katsaggelos,et al.  Speech-to-video synthesis using facial animation parameters , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[25]  Darius Burschka,et al.  Advances in Computational Stereo , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  Aggelos K. Katsaggelos,et al.  Speech-to-video synthesis using MPEG-4 compliant visual features , 2003, IEEE Transactions on Circuits and Systems for Video Technology.

[27]  Zhigang Deng,et al.  Audio-based head motion synthesis for Avatar-based telepresence systems , 2004, ETP '04.

[28]  Shrikanth S. Narayanan,et al.  An automatic prosody recognizer using a coupled multi-stream acoustic model and a syntactic-prosodic language model , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[29]  Christoph Bregler,et al.  Mood swings: expressive speech animation , 2005, TOGS.

[30]  A. Murat Tekalp,et al.  Combined Gesture-Speech Analysis and Speech Driven Gesture Synthesis , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[31]  Abeer Alwan,et al.  Acoustically-Driven Talking Face Synthesis using Dynamic Bayesian Networks , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[32]  Harry Shum,et al.  Learning dynamic audio-visual mapping with input-output Hidden Markov models , 2006, IEEE Trans. Multim..

[33]  A. Murat Tekalp,et al.  Prosody-Driven Head-Gesture Animation , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.