Learning Individual Styles of Conversational Gesture

Human speech is often accompanied by hand and arm gestures. We present a method for cross-modal translation from "in-the-wild" monologue speech of a single speaker to their conversational gesture motion. We train on unlabeled videos for which we only have noisy pseudo ground truth from an automatic pose detection system. Our proposed model significantly outperforms baseline methods in a quantitative comparison. To support research toward obtaining a computational understanding of the relationship between gesture and speech, we release a large video dataset of person-specific gestures.

[1]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[2]  B. Butterworth,et al.  Gesture, speech, and computational stages: a reply to McNeill. , 1989, Psychological review.

[3]  Mark Steedman,et al.  Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents , 1994, SIGGRAPH.

[4]  C. Creider Hand and Mind: What Gestures Reveal about Thought , 1994 .

[5]  William T. Freeman,et al.  Orientation Histograms for Hand Gesture Recognition , 1995 .

[6]  Alex Pentland,et al.  Task-Specific Gesture Analysis in Real-Time Using Interpolated Views , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[8]  D. McNeill,et al.  Speech-gesture mismatches: Evidence for one underlying representation of linguistic and nonlinguistic information , 1998 .

[9]  J. Cassell,et al.  Embodied conversational agents , 2000 .

[10]  Rashid Ansari,et al.  Multimodal human discourse: gesture and speech , 2002, TCHI.

[11]  Justine Cassell,et al.  BEAT: the Behavior Expression Animation Toolkit , 2001, Life-like characters.

[12]  A. Kendon Gesture: Visible Action as Utterance , 2004 .

[13]  Stefan Kopp,et al.  Towards a Common Framework for Multimodal Generation: The Behavior Markup Language , 2006, IVA.

[14]  Michael Neff,et al.  Towards Natural Gesture Synthesis: Evaluating Gesture Units in a Data-Driven Approach to Gesture Synthesis , 2007, IVA.

[15]  Trevor Darrell,et al.  Latent-Dynamic Discriminative Models for Continuous Gesture Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Hans-Peter Seidel,et al.  Annotated New Text Engine Animation Animation Lexicon Animation Gesture Profiles MR : . . . JL : . . . Gesture Generation Video Annotated Gesture Script , 2007 .

[17]  Stacy Marsella,et al.  SmartBody: behavior realization for embodied conversational agents , 2008, AAMAS.

[18]  Sergey Levine,et al.  Real-time prosody-driven synthesis of body language , 2009, SIGGRAPH 2009.

[19]  Andrew Zisserman,et al.  Learning sign language by watching TV (using weakly aligned subtitles) , 2009, CVPR.

[20]  Susan Goldin-Meadow,et al.  Using the Hands to Identify Who Does What to Whom: Gesture and Speech Go Hand-in-Hand , 2009, Cogn. Sci..

[21]  Sergey Levine,et al.  Gesture controllers , 2010, SIGGRAPH 2010.

[22]  Stacy Marsella,et al.  How to Train Your Avatar: A Data Driven Approach to Gesture Generation , 2011, IVA.

[23]  Jan Peter de Ruiter,et al.  The Interplay Between Gesture and Speech in the Production of Referring Expressions: Investigating the Tradeoff Hypothesis , 2012, Top. Cogn. Sci..

[24]  Yi Yang,et al.  Articulated Human Detection with Flexible Mixtures of Parts , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Anton Leuski,et al.  All Together Now - Introducing the Virtual Human Toolkit , 2013, IVA.

[26]  Yuyu Xu,et al.  Virtual character performance from speech , 2013, SCA '13.

[27]  Guest Editorial Gesture and speech in interaction : An overview , 2013 .

[28]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[29]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[30]  Doug L. James,et al.  Inverse-Foley animation , 2014, ACM Trans. Graph..

[31]  Andrew Zisserman,et al.  Deep Convolutional Neural Networks for Efficient Pose Estimation in Gesture Videos , 2014, ACCV.

[32]  Carlos Busso,et al.  Retrieving Target Gestures Toward Speech Driven Animation with Meaningful Behaviors , 2015, ICMI.

[33]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[34]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[35]  Hermann Ney,et al.  Deep Hand: How to Train a CNN on 1 Million Hand Images When Your Data is Continuous and Weakly Labelled , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Sarajane Marques Peres,et al.  Gesture phase segmentation using support vector machines , 2016, Expert Syst. Appl..

[37]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[38]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Ira Kemelmacher-Shlizerman,et al.  Synthesizing Obama , 2017, ACM Trans. Graph..

[41]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Joon Son Chung,et al.  You said that? , 2017, BMVC.

[44]  Hermann Ney,et al.  Neural Sign Language Translation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]  Jitendra Malik,et al.  From Lifestyle Vlogs to Everyday Interactions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Ira Kemelmacher-Shlizerman,et al.  Audio to Body Dynamics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47]  Alexei A. Efros,et al.  Everybody Dance Now , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).