Speaking with hands: creating animated conversational characters from recordings of human performance

We describe a method for using a database of recorded speech and captured motion to create an animated conversational character. People's utterances are composed of short, clearly-delimited phrases; in each phrase, gesture and speech go together meaningfully and synchronize at a common point of maximum emphasis. We develop tools for collecting and managing performance data that exploit this structure. The tools help create scripts for performers, help annotate and segment performance data, and structure specific messages for characters to use within application contexts. Our animations then reproduce this structure. They recombine motion samples with new speech samples to recreate coherent phrases, and blend segments of speech and motion together phrase-by-phrase into extended utterances. By framing problems for utterance generation and synthesis so that they can draw closely on a talented performance, our techniques support the rapid construction of animated characters with rich and appropriate expression.

[1]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[2]  J. Pierrehumbert,et al.  The Meaning of Intonational Contours in the Interpretation of Discourse , 1990 .

[3]  Mari Ostendorf,et al.  TOBI: a standard for labeling English prosody , 1992, ICSLP.

[4]  D. McNeill Hand and Mind: What Gestures Reveal about Thought , 1992 .

[5]  Mark Steedman,et al.  Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents , 1994, SIGGRAPH.

[6]  Zoran Popovic,et al.  Motion warping , 1995, SIGGRAPH.

[7]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[8]  Mark Steedman,et al.  Generating Facial Expressions for Speech , 1996, Cogn. Sci..

[9]  Ken Perlin,et al.  Improv: a system for scripting interactive actors in virtual worlds , 1996, SIGGRAPH.

[10]  Michael Gleicher,et al.  Retargetting motion to new characters , 1998, SIGGRAPH.

[11]  Michael F. Cohen,et al.  Verbs and Adverbs: Multidimensional Motion Interpolation , 1998, IEEE Computer Graphics and Applications.

[12]  Zoran Popovic,et al.  Physically based motion transformation , 1999, SIGGRAPH.

[13]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[14]  Marc C. Beutnagel,et al.  The AT & T NEXT-GEN TTS system , 1999 .

[15]  Jörn Ostermann,et al.  Multimodal speech synthesis , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[16]  Irene Langkilde Forest-Based Statistical Sentence Generation , 2000, ANLP.

[17]  Mark Steedman,et al.  Information Structure and the Syntax-Phonology Interface , 2000, Linguistic Inquiry.

[18]  Justine Cassell,et al.  Embodied conversational interface agents , 2000, CACM.

[19]  Alan W. Black,et al.  Limited domain synthesis , 2000, INTERSPEECH.

[20]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[21]  J. Cassell,et al.  Embodied conversational agents , 2000 .

[22]  Norman I. Badler,et al.  The EMOTE model for effort and shape , 2000, SIGGRAPH.

[23]  Srinivas Bangalore,et al.  Exploiting a Probabilistic Hierarchical Model for Generation , 2000, COLING.

[24]  J. Bavelas,et al.  Visible Acts of Meaning , 2000 .

[25]  Robert Dale,et al.  Building Natural Language Generation Systems: Figures , 2000 .

[26]  Randi A. Engle,et al.  Toward a theory of multimodal communication combining speech, gestures, diagrams, and demonstrations in instructional explanations , 2000 .

[27]  Francis K. H. Quek,et al.  Catchments, prosody and discourse , 2001 .

[28]  Ehud Reiter,et al.  Book Reviews: Building Natural Language Generation Systems , 2000, CL.

[29]  Emiel Krahmer,et al.  From data to speech: a general approach , 2001, Natural Language Engineering.

[30]  Harry Shum,et al.  Motion texture: a two-level statistical model for character motion synthesis , 2002, ACM Trans. Graph..

[31]  Jens Edlund,et al.  Specification and realisation of multimodal output in dialogue systems , 2002, INTERSPEECH.

[32]  Jessica K. Hodgins,et al.  Interactive control of avatars animated with human motion data , 2002, SIGGRAPH.

[33]  Christoph Bregler,et al.  Motion capture assisted animation: texturing and synthesis , 2002, ACM Trans. Graph..

[34]  Lucas Kovar,et al.  Footskate cleanup for motion capture editing , 2002, SCA '02.

[35]  Okan Arikan,et al.  Interactive motion generation from examples , 2002, ACM Trans. Graph..

[36]  Sung Yong Shin,et al.  On-line locomotion generation based on motion blending , 2002, SCA '02.

[37]  Mari Ostendorf,et al.  Efficient integrated response generation from multiple targets using weighted finite state transducers , 2002, Comput. Speech Lang..

[38]  Stephanie Seneff,et al.  Response planning and generation in the MERCURY flight reservation system , 2002, Comput. Speech Lang..

[39]  Shimei Pan,et al.  Designing a Speech Corpus for Instance-based Spoken Language Generation , 2002, INLG.

[40]  Matthew Stone,et al.  Making discourse visible: coding and animating conversational facial displays , 2002, Proceedings of Computer Animation 2002 (CA 2002).

[41]  Norman I. Badler,et al.  Eyes alive , 2002, ACM Trans. Graph..

[42]  Matthew Stone,et al.  Crafting the illusion of meaning: template-based specification of embodied conversational behavior , 2003, Proceedings 11th IEEE International Workshop on Program Comprehension.

[43]  David A. Forsyth,et al.  Motion synthesis from annotations , 2003, ACM Trans. Graph..

[44]  T. Bickmore Relational agents : effecting change through human-computer relationships , 2003 .

[45]  Nancy S. Pollard,et al.  Perceptual metrics for character animation: sensitivity to errors in ballistic motion , 2003, ACM Trans. Graph..

[46]  J. Beskow Talking Heads - Models and Applications for Multimodal Speech Synthesis , 2003 .

[47]  Lucas Kovar,et al.  Flexible automatic motion blending with registration curves , 2003, SCA '03.

[48]  Sung Yong Shin,et al.  Rhythmic-motion synthesis based on motion-beat analysis , 2003, ACM Trans. Graph..

[49]  Dan Klein,et al.  Factored A* Search for Models over Sequences and Trees , 2003, IJCAI.

[50]  Loredana Cerrato,et al.  A method for the analysis and measurement of communicative head movements in human dialogues , 2003, AVSP.

[51]  Justine Cassell,et al.  BEAT: the Behavior Expression Animation Toolkit , 2001, Life-like characters.

[52]  Tony Ezzat,et al.  Trainable videorealistic speech animation , 2002, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[53]  Stefan Kopp,et al.  Synthesizing multimodal utterances for conversational agents , 2004, Comput. Animat. Virtual Worlds.

[54]  Lance Williams,et al.  Performance-driven facial animation , 1990, SIGGRAPH Courses.