The Influence of Prosody on the Requirements for Gesture-Text Alignment

Designing an agent capable of multimodal communication requires synchronization of the agent’s performance across its communication channels: text, prosody, gesture, body movement and facial expressions. The synchronization of gesture and spoken text has significant repercussions for agent design. To explore this issue, we examined people’s sensitivity to misalignments between gesture and spoken text, varying both the gesture type and the prosodic emphasis. This study included ratings of individual clips and ratings of paired clips with different alignments. Subjects were unable to notice alignment errors of up to ±0.6s when shown a single clip. However, when shown paired clips, gestures occurring after the lexical affiliate are rated less positively. There is also evidence that stronger prosody cues make people more sensitive to misalignment. This suggests that agent designers may be able to “cheat” when it comes to maintaining tight synchronization between audio and gesture without a decrease in agent naturalness, but this cheating may not be optimal.

[1]  Stefan Kopp,et al.  Synthesizing multimodal utterances for conversational agents , 2004, Comput. Animat. Virtual Worlds.

[2]  Irene Albrecht,et al.  Automatic Generation of Non-Verbal Facial Expressions from Speech , 2002 .

[3]  Christoph Bregler,et al.  Mood swings: expressive speech animation , 2005, TOGS.

[4]  J. Montepare,et al.  The Use of Body Movements and Gestures as Cues to Emotions in Younger and Older Adults , 1999 .

[5]  D. McNeill Gesture and Thought , 2005 .

[6]  Jan Peter De Ruiter,et al.  On the audiovisual integration of speech and gesture , 2012 .

[7]  Maurizio Gentilucci,et al.  On gesture and speech , 2015 .

[8]  K. Scherer,et al.  Bodily expression of emotion , 2009 .

[9]  J. Terken Fundamental frequency and perceived prominence of accented syllables. , 1991, The Journal of the Acoustical Society of America.

[10]  J. P. Foley,et al.  Gesture and Environment , 1942 .

[11]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[12]  M. Studdert-Kennedy Hand and Mind: What Gestures Reveal About Thought. , 1994 .

[13]  Sergey Levine,et al.  Real-time prosody-driven synthesis of body language , 2009, ACM Trans. Graph..

[14]  Justine Cassell,et al.  BEAT: the Behavior Expression Animation Toolkit , 2001, Life-like characters.

[15]  Hans-Peter Seidel,et al.  Annotated New Text Engine Animation Animation Lexicon Animation Gesture Profiles MR : . . . JL : . . . Gesture Generation Video Annotated Gesture Script , 2007 .

[16]  Jehee Lee,et al.  Expressive Facial Gestures From Motion Capture Data , 2008, Comput. Graph. Forum.

[17]  Trevor Darrell,et al.  Head gestures for perceptual interfaces: The role of context in improving recognition , 2007, Artif. Intell..

[18]  R. Adolphs Neural systems for recognizing emotion , 2002, Current Opinion in Neurobiology.

[19]  A. Murat Tekalp,et al.  Prosody-Driven Head-Gesture Animation , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[20]  A. Leroi‐Gourhan,et al.  Gesture and Speech , 1993 .

[21]  Stefanie Shattuck-Hufnagel,et al.  THE TIMING OF SPEECH-ACCOMPANYING GESTURES WITH RESPECT TO PROSODY , 2004 .

[22]  Rae A. Earnshaw,et al.  Advances in Modelling, Animation and Rendering , 2002, Springer London.

[23]  Matthew Stone,et al.  Speaking with hands: creating animated conversational characters from recordings of human performance , 2004, ACM Trans. Graph..

[24]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[25]  S. Levine,et al.  Gesture controllers , 2010, ACM Trans. Graph..