Automatic head gesture learning and synthesis from prosodic cues

We present a novel approach to automatically learn and synthesize head gestures using prosodic features extracted from acoustic speech signals. A minimum entropy hidden Markov model is employed to learn the 3-D head-motion of a speaker. The result is a generative model that is compact and highly predictive. The model is further exploited to synchronize the head-motion with a set of continuous prosodic observations and gather the correspondence between the two by sharing its state machine. In synthesis, the prosodic features are used as the cue signal to drive the generative model so that 3-D head gestures can be inferred. A tracking algorithm based on the Bézier volume deformation model is implemented to track the head-motion. To evaluate the performance of the proposed system, we compare the true head-motion with the prosody-inferred motion. The prosody to head-motion mapping acquired through learning is subsequently applied to animate a talking head. Very convincing head-gestures are produced when novel prosodic cues of the same speaker are presented.

[1]  Takaaki Kuratate,et al.  Audio-visual synthesis of talking faces from speech production correlates. , 1999 .

[2]  Mari Ostendorf,et al.  Automatic labeling of prosodic patterns , 1994, IEEE Trans. Speech Audio Process..

[3]  Tsuhan Chen,et al.  Audio-visual integration in multimodal communication , 1998, Proc. IEEE.

[4]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[5]  Matthew Brand Structure and parameter learning via entropy minimization, with applications to mixture and hidden Markov models , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[6]  Thoms M. Levergood,et al.  DEC face: an automatic lip-synchronization algorithm for synthetic faces , 1993 .

[7]  Satoshi Nakamura,et al.  Lip movement synthesis from speech based on hidden Markov models , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[8]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[9]  Thomas S. Huang,et al.  Bézier Volume Deformation Model for Facial Animation and Video Tracking , 1998, CAPTECH.