Speaking with hands: creating animated conversational characters from recordings of human performance

We describe a method for using a database of recorded speech and captured motion to create an animated conversational character. People's utterances are composed of short, clearly-delimited phrases; in each phrase, gesture and speech go together meaningfully and synchronize at a common point of maximum emphasis. We develop tools for collecting and managing performance data that exploit this structure. The tools help create scripts for performers, help annotate and segment performance data, and structure specific messages for characters to use within application contexts. Our animations then reproduce this structure. They recombine motion samples with new speech samples to recreate coherent phrases, and blend segments of speech and motion together phrase-by-phrase into extended utterances. By framing problems for utterance generation and synthesis so that they can draw closely on a talented performance, our techniques support the rapid construction of animated characters with rich and appropriate expression.

[1]  Stefan Kopp,et al.  Synthesizing multimodal utterances for conversational agents , 2004, Comput. Animat. Virtual Worlds.

[2]  Okan Arikan,et al.  Interactive motion generation from examples , 2002, ACM Trans. Graph..

[3]  Lucas Kovar,et al.  Footskate cleanup for motion capture editing , 2002, SCA '02.

[4]  Justine Cassell,et al.  Embodied conversational interface agents , 2000, CACM.

[5]  Matthew Stone,et al.  Crafting the illusion of meaning: template-based specification of embodied conversational behavior , 2003, Proceedings 11th IEEE International Workshop on Program Comprehension.

[6]  Harry Shum,et al.  Motion texture: a two-level statistical model for character motion synthesis , 2002, ACM Trans. Graph..

[7]  Mark Steedman,et al.  Information Structure and the Syntax-Phonology Interface , 2000, Linguistic Inquiry.

[8]  Shimei Pan,et al.  Designing a Speech Corpus for Instance-based Spoken Language Generation , 2002, INLG.

[9]  Alan W. Black,et al.  Limited domain synthesis , 2000, INTERSPEECH.

[10]  Srinivas Bangalore,et al.  Exploiting a Probabilistic Hierarchical Model for Generation , 2000, COLING.

[11]  N. Badler,et al.  Eyes Alive Eyes Alive Eyes Alive Figure 1: Sample Images of an Animated Face with Eye Movements , 2022 .

[12]  Jessica K. Hodgins,et al.  Interactive control of avatars animated with human motion data , 2002, SIGGRAPH.

[13]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[14]  Nancy S. Pollard,et al.  Perceptual metrics for character animation: sensitivity to errors in ballistic motion , 2003, ACM Trans. Graph..

[15]  Michael F. Cohen,et al.  Verbs and Adverbs: Multidimensional Motion Interpolation , 1998, IEEE Computer Graphics and Applications.

[16]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[17]  J. Beskow Talking Heads - Models and Applications for Multimodal Speech Synthesis , 2003 .

[18]  Emiel Krahmer,et al.  From data to speech: a general approach , 2001, Natural Language Engineering.

[19]  Mari Ostendorf,et al.  Efficient integrated response generation from multiple targets using weighted finite state transducers , 2002, Comput. Speech Lang..

[20]  Dan Klein,et al.  Factored A* Search for Models over Sequences and Trees , 2003, IJCAI.

[21]  Marc C. Beutnagel,et al.  The AT & T NEXT-GEN TTS system , 1999 .

[22]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[23]  Sung Yong Shin,et al.  On-line locomotion generation based on motion blending , 2002, SCA '02.

[24]  Lucas Kovar,et al.  Flexible automatic motion blending with registration curves , 2003, SCA '03.

[25]  Michael Gleicher,et al.  Retargetting motion to new characters , 1998, SIGGRAPH.

[26]  Norman I. Badler,et al.  The EMOTE model for effort and shape , 2000, SIGGRAPH.

[27]  Mark Steedman,et al.  Generating Facial Expressions for Speech , 1996, Cogn. Sci..

[28]  Loredana Cerrato,et al.  A method for the analysis and measurement of communicative head movements in human dialogues , 2003, AVSP.

[29]  Zoran Popovic,et al.  Motion warping , 1995, SIGGRAPH.

[30]  Irene Langkilde-Geary,et al.  Forest-Based Statistical Sentence Generation , 2000, ANLP.

[31]  Mark Steedman,et al.  Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents , 1994, SIGGRAPH.

[32]  Stephanie Seneff,et al.  Response planning and generation in the MERCURY flight reservation system , 2002, Comput. Speech Lang..

[33]  Mari Ostendorf,et al.  TOBI: a standard for labeling English prosody , 1992, ICSLP.

[34]  Ehud Reiter,et al.  Book Reviews: Building Natural Language Generation Systems , 2000, CL.

[35]  Francis K. H. Quek,et al.  Catchments, prosody and discourse , 2001 .

[36]  Zoran Popovic,et al.  Physically based motion transformation , 1999, SIGGRAPH.

[37]  Matthew Stone,et al.  Making discourse visible: coding and animating conversational facial displays , 2002, Proceedings of Computer Animation 2002 (CA 2002).

[38]  Tony Ezzat,et al.  Trainable videorealistic speech animation , 2002, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[39]  Randi A. Engle,et al.  Toward a theory of multimodal communication combining speech, gestures, diagrams, and demonstrations in instructional explanations , 2000 .

[40]  Christoph Bregler,et al.  Motion capture assisted animation: texturing and synthesis , 2002, ACM Trans. Graph..

[41]  Lance Williams,et al.  Performance-driven facial animation , 1990, SIGGRAPH.

[42]  Sung Yong Shin,et al.  Rhythmic-motion synthesis based on motion-beat analysis , 2003, ACM Trans. Graph..

[43]  David A. Forsyth,et al.  Motion synthesis from annotations , 2003, ACM Trans. Graph..

[44]  T. Bickmore Relational agents : effecting change through human-computer relationships , 2003 .

[45]  Antje Schweitzer,et al.  Multimodal Speech Synthesis , 2006, SmartKom.

[46]  M. Studdert-Kennedy Hand and Mind: What Gestures Reveal About Thought. , 1994 .

[47]  Jens Edlund,et al.  Specification and realisation of multimodal output in dialogue systems , 2002, INTERSPEECH.

[48]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[49]  Justine Cassell,et al.  BEAT: the Behavior Expression Animation Toolkit , 2001, Life-like characters.