Paired Speech and Gesture Generation in Embodied Conversational Agents

Using face-to-face conversation as an interface metaphor, an embodied conversational agent is likely to be easier to use and learn than traditional graphical user interfaces. To make a believable agent that to some extent has the same social and conversational skills as humans do, the embodied conversational agent system must be able to deal with input of the user from different communication modalities such as speech and gesture, as well as generate appropriate behaviors for those communication modalities. In this thesis, I address the problem of paired speech and gesture generation in embodied conversational agents. I propose a real-time generation framework that is capable of generating a comprehensive description of communicative actions, including speech, gesture, and intonation, in the real-estate domain. The generation of speech, gesture, and intonation are based on the same underlying representation of real-estate properties, discourse information structure, intentional and attentional structures, and a mechanism to update the common ground between the user and the agent. Algorithms have been implemented to analyze the discourse information structure, contrast, and surprising semantic features, which together decide the intonation contour of the speech utterances and where gestures occur. I also investigate through a correlational study the role of communicative goals in determining the distribution of semantic features across speech and gesture modalities. Thesis Advisor: Justine Cassell Associate Professor of Media Arts and Sciences AT&T Career Development Professor of Media Arts and Sciences Paired Speech and Gesture Generation in Embodied Conversational Agents

[1]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[2]  Verzekeren Naar Sparen,et al.  Cambridge , 1969, Humphrey Burton: In My Own Time.

[3]  Charles Antaki Research on Language and Social Interaction , 1969 .

[4]  A. Kendon Some Relationships Between Body Motion and Speech , 1972 .

[5]  W. Rogers,et al.  THE CONTRIBUTION OF KINESIC ILLUSTRATORS TOWARD THE COMPREHENSION OF VERBAL BEHAVIOR WITHIN UTTERANCES , 1978 .

[6]  Richard A. Bolt,et al.  “Put-that-there”: Voice and gesture at the graphics interface , 1980, SIGGRAPH '80.

[7]  Ellen F. Prince,et al.  Toward a taxonomy of given-new information , 1981 .

[8]  C. Goodwin Conversational Organization: Interaction Between Speakers and Hearers , 1981 .

[9]  Gillian R Brown,et al.  Prosodic Structure and the Given/New Distinction , 1983 .

[10]  Candace L. Sidner,et al.  Attention, Intentions, and the Structure of Discourse , 1986, CL.

[11]  Prosodic patterns in spoken English: studies in the correlation between prosody and grammar for text-to-speech conversion , 1988 .

[12]  Steven K. Feiner,et al.  Coordinating Text and Graphics in Explanation Generation , 1989, HLT.

[13]  B. Butterworth,et al.  Gesture, speech, and computational stages: a reply to McNeill. , 1989, Psychological review.

[14]  Julia Hirschberg,et al.  Accent and Discourse Context: Assigning Pitch Accent in Synthetic Speech , 1990, AAAI.

[15]  J. Pierrehumbert,et al.  The Meaning of Intonational Contours in the Interpretation of Discourse , 1990 .

[16]  M. Handzic 5 , 1824, The Banality of Heidegger.

[17]  John E. Howland,et al.  Computer graphics , 1990, IEEE Potentials.

[18]  Wolfgang Wahlster,et al.  Designing Illustrated Texts: How Language Production Is Influenced by Graphics Generation , 1991, EACL.

[19]  Alexander Ian Campbell Monaghan,et al.  Intonation in a text-to-speech conversion system , 1991 .

[20]  Nicole Chovil Discourse‐oriented facial displays in conversation , 1991 .

[21]  A. Kendon Do Gestures Communicate? A Review , 1994 .

[22]  Mark Steedman,et al.  Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents , 1994, SIGGRAPH.

[23]  Mark Steedman,et al.  Specifying intonation from context for speech synthesis , 1994, Speech Communication.

[24]  M. Studdert-Kennedy Hand and Mind: What Gestures Reveal About Thought. , 1994 .

[25]  Timothy W. Finin,et al.  KQML as an agent communication language , 1994, CIKM '94.

[26]  Scott Prevost Modeling contrast in the generation and synthesis of spoken language , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[27]  Satoru Hayamizu,et al.  Are Listeners Paying Attention to the Hand Gestures of an Anthropomorphic Agent? An Evaluation Using a Gaze Tracking Method , 1997, Gesture Workshop.

[28]  Matthew Stone,et al.  Sentence Planning as Description Using Tree Adjoining Grammar , 1997, ACL.

[29]  James C. Lester,et al.  Animated pedagogical agents and problem-solving effectiveness: a large-scale empirical evaluation , 1997 .

[30]  Justine Cassell,et al.  Semantic and Discourse Information for Text-to-Speech Intonation , 1997, Workshop On Concept To Speech Generation Systems.

[31]  Johanna D. Moore,et al.  A Media-Independent Content Language for Integrated Text and Graphics Generation , 1998 .

[32]  Obed E Torres,et al.  Producing semantically appropriate gestures in embodied language generation , 1998 .

[33]  J. Cassell,et al.  Turn taking vs. Discourse Structure: How Best to Model Multimodal Conversation , 1998 .

[34]  Thomas Rist,et al.  Integrating reactive and scripted behaviors in a life-like presentation agent , 1998, AGENTS '98.

[35]  J. Cassell,et al.  Conversation as a System Framework : Designing Embodied Conversational Agents , 1999 .

[36]  W. Lewis Johnson,et al.  Animated Agents for Procedural Training in Virtual Reality: Perception, Cognition, and Motor Control , 1999, Appl. Artif. Intell..

[37]  Matthew Stone,et al.  Living Hand to Mouth: Psychological Theories about Speech and Gesture in Interactive Dialogue Systems , 1999 .

[38]  Thomas Rist,et al.  Presenting through performing: on the use of multiple lifelike characters in knowledge-based presentation systems , 2000, IUI '00.

[39]  James C. Lester,et al.  Deictic and emotive communication in animated pedagogical agents , 2001 .