Gesture synthesis adapted to speech emphasis

Avatars communicate through speech and gestures to appear realistic and to enhance interaction with humans. In this context, several works have analyzed the relationship between speech and gestures, while others have been focused on their synthesis, following different approaches. In this work, we address both goals by linking speech to gestures in terms of time and intensity, to then use this knowledge to drive a gesture synthesizer from a manually annotated speech signal. To that effect, we define strength indicators for speech and motion. After validating them through perceptual tests, we obtain an intensity rule from their correlation. Moreover, we derive a synchrony rule to determine temporal correspondences between speech and gestures. These analyses have been conducted on aggressive and neutral performances to cover a broad range of emphatic levels, whose speech signal and motion have been manually annotated. Next, intensity and synchrony rules are used to drive a gesture synthesizer called gesture motion graph (GMG). These rules are validated by users from GMG output animations through perceptual tests. Results show that animations using intensity and synchrony rules perform better than those only using the synchrony rule (which in turn enhance realism with respect to random animation). Finally, we conclude that the extracted rules allow GMG to properly synthesize gestures adapted to speech emphasis from annotated speech.

[1]  Shrikanth S. Narayanan,et al.  A Statistical Approach for Modeling Prosody Features using POS Tags for Emotional Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[2]  Okan Arikan,et al.  Interactive motion generation from examples , 2002, ACM Trans. Graph..

[3]  Louis ten Bosch,et al.  Acoustical features as predictors for prominence in read aloud dutch sentences used in ANN's , 1999, EUROSPEECH.

[4]  Alumne Professor Ponent,et al.  GESTURE–PROSODY CORRELATIONS ANALYSIS , 2012 .

[5]  Mark Steedman,et al.  Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents , 1994, SIGGRAPH.

[6]  Rashid Ansari,et al.  Multimodal signal analysis of prosody and hand motion: Temporal correlation of speech and gestures , 2002, 2002 11th European Signal Processing Conference.

[7]  Lucas Kovar,et al.  Motion graphs , 2002, SIGGRAPH Classes.

[8]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[9]  Julia Hirschberg,et al.  Automatic ToBI prediction and alignment to speed manual labeling of prosody , 2001, Speech Commun..

[10]  Emiel Krahmer,et al.  Generating Multimodal References , 2007 .

[11]  Matthew Stone,et al.  Speaking with hands: creating animated conversational characters from recordings of human performance , 2004, ACM Trans. Graph..

[12]  J. Russell A circumplex model of affect. , 1980 .

[13]  Catherine Pelachaud,et al.  Multimodal expressive embodied conversational agents , 2005, ACM Multimedia.

[14]  D. McNeill Gesture and Thought , 2005 .

[15]  Fred Cummins,et al.  The temporal relation between beat gestures and speech , 2011 .

[16]  J. Terken Fundamental frequency and perceived prominence of accented syllables. , 1991, The Journal of the Acoustical Society of America.

[17]  Sotaro Kita,et al.  Movement Phase in Signs and Co-Speech Gestures, and Their Transcriptions by Human Coders , 1997, Gesture Workshop.

[18]  Stacy Marsella,et al.  How to Train Your Avatar: A Data Driven Approach to Gesture Generation , 2011, IVA.

[19]  Bobby Bodenheimer,et al.  Synthesis and evaluation of linear motion transitions , 2008, TOGS.

[20]  Sergey Levine,et al.  Real-time prosody-driven synthesis of body language , 2009, SIGGRAPH 2009.

[21]  Mari Ostendorf,et al.  TOBI: a standard for labeling English prosody , 1992, ICSLP.

[22]  Justine Cassell,et al.  BEAT: the Behavior Expression Animation Toolkit , 2001, Life-like characters.

[23]  Masaki Oshita,et al.  Generating Avoidance Motion Using Motion Graph , 2011, MIG.

[24]  S. Nobe,et al.  Representational gestures, cognitive rhythms, and acoustic aspects of speech: A network threshold model of gesture production , 1996 .

[25]  Shrikanth S. Narayanan,et al.  Prominence Detection Using Auditory Attention Cues and Task-Dependent High Level Information , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Michael Kipp,et al.  Gesture generation by imitation: from human behavior to computer character animation , 2005 .

[27]  Santiago Planet,et al.  TRUE: an Online Testing Platform for Multimedia Evaluation , 2008, LREC 2008.

[28]  Mary E. Beckman,et al.  Intonation across Spanish, in the Tones and Break Indices framework , 2002 .

[29]  Michael Neff,et al.  Augmenting Gesture Animation with Motion Capture Data to Provide Full-Body Engagement , 2009, IVA.

[30]  Xueying Zhang,et al.  A HMM-based fuzzy affective model for emotional speech synthesis , 2010, 2010 2nd International Conference on Signal Processing Systems.

[31]  Xueying Zhang,et al.  A HMM-based Fuzzy Computing Model for Emotional Speech Recognition , 2010, 2010 First International Conference on Pervasive Computing, Signal Processing and Applications.

[32]  Rosaria Silipo,et al.  AUTOMATIC TRANSCRIPTION OF PROSODIC STRESS FOR SPONTANEOUS ENGLISH DISCOURSE , 1999 .

[33]  C. J. van Rijsbergen,et al.  FOUNDATION OF EVALUATION , 1974 .

[34]  E. Vilaplana,et al.  La notación prosódica del español: una revisión del Sp-ToBI , 2008 .

[35]  William S. Condon An Analysis of Behavioral Organization , 2013 .

[36]  Anne-Catherine Simon,et al.  A Model for Varying Speaking Style in TTS systems , 2010 .

[37]  Justine Cassell,et al.  Avatar-augmented online conversation , 2003 .

[38]  Igor S. Pandzic,et al.  State of the Art in Example‐Based Motion Synthesis for Virtual Characters in Interactive Applications , 2010, Comput. Graph. Forum.

[39]  C. Creider Hand and Mind: What Gestures Reveal about Thought , 1994 .

[40]  Christos Faloutsos,et al.  FMDistance: A Fast and Effective Distance Function for Motion Capture Data , 2008, Eurographics.

[41]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[42]  Stefanie Shattuck-Hufnagel,et al.  THE TIMING OF SPEECH-ACCOMPANYING GESTURES WITH RESPECT TO PROSODY , 2004 .

[43]  J. Pierrehumbert The phonology and phonetics of English intonation , 1987 .

[44]  Adso Fernández-Baena,et al.  Progressive transitions using body part motion graphs , 2011, SA '11.

[45]  Pilar Prieto,et al.  Acoustic Correlates of Stress in Central Catalan and Castilian Spanish , 2011, Language and speech.

[46]  Hans-Peter Seidel,et al.  Annotated New Text Engine Animation Animation Lexicon Animation Gesture Profiles MR : . . . JL : . . . Gesture Generation Video Annotated Gesture Script , 2007 .

[47]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[48]  Michael Kipp,et al.  ANVIL - a generic annotation tool for multimodal dialogue , 2001, INTERSPEECH.

[49]  D. McNeill So you think gestures are nonverbal , 1985 .

[50]  Héctor Ortiz Lira La aplicación de tobi a un corpus del español de Chile , 1999 .

[51]  Jianfeng Xu,et al.  Motion synthesis for synchronizing with streaming music by segment-based search on metadata motion graphs , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[52]  A. Egges,et al.  Evaluating distance metrics for animation blending , 2009, FDG.

[53]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[54]  Jean-Claude Martin,et al.  Gesture and emotion: Can basic gestural form features discriminate emotions? , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[55]  Jessica K. Hodgins,et al.  Interactive control of avatars animated with human motion data , 2002, SIGGRAPH.

[56]  Jean-Philippe Goldman,et al.  EasyAlign: An Automatic Phonetic Alignment Tool Under Praat , 2011, INTERSPEECH.

[57]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[58]  H. Wallbott Bodily expression of emotion , 1998 .

[59]  Sergey Levine,et al.  Gesture controllers , 2010, SIGGRAPH 2010.

[60]  Junsong Yuan,et al.  Depth camera based hand gesture recognition and its applications in Human-Computer-Interaction , 2011, 2011 8th International Conference on Information, Communications & Signal Processing.

[61]  Bogdan Ludusan,et al.  Pitch behavior detection for automatic prominence recognition , 2010 .

[62]  D. Bolinger Intonation and its parts : melody in spoken English , 1987 .

[63]  Stefan Kopp,et al.  Multimodal Communication from Multimodal Thinking - towards an Integrated Model of Speech and Gesture Production , 2008, Int. J. Semantic Comput..

[64]  Michael Neff,et al.  Towards Natural Gesture Synthesis: Evaluating Gesture Units in a Data-Driven Approach to Gesture Synthesis , 2007, IVA.

[65]  Lourdes Aguilar,et al.  Procedure for assessing the reliability of prosodic judgements using Sp-TOBI labeling system , 2010 .

[66]  Mark H. Overmars,et al.  Real Time Animation of Virtual Humans: A Trade‐off Between Naturalness and Control , 2010, Comput. Graph. Forum.

[67]  Petra Wagner,et al.  On automatic prominence detection for German , 2007, INTERSPEECH.