Audio to Body Dynamics

We present a method that gets as input an audio of violin or piano playing, and outputs a video of skeleton predictions which are further used to animate an avatar. The key idea is to create an animation of an avatar that moves their hands similarly to how a pianist or violinist would do, just from audio. Notably, it's not clear if body movement can be predicted from music at all and our aim in this work is to explore this possibility. In this paper, we present the first result that shows that natural body dynamics can be predicted. We built an LSTM network that is trained on violin and piano recital videos uploaded to the Internet. The predicted points are applied onto a rigged avatar to create the animation.

[1]  C. Karen Liu,et al.  Learning physics-based motion style with nonlinear inverse optimization , 2005, ACM Trans. Graph..

[2]  Jitendra Malik,et al.  Recurrent Network Models for Human Dynamics , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Scott E. Hudson,et al.  Combining body pose, gaze, and gesture to determine intention to interact in vision-based interfaces , 2014, CHI.

[5]  Yaser Sheikh,et al.  Hand Keypoint Detection in Single Images Using Multiview Bootstrapping , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Xiaogang Wang,et al.  Multi-source Deep Learning for Human Pose Estimation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Yi Zhou,et al.  Auto-Conditioned LSTM Network for Extended Complex Human Motion Synthesis , 2017, ArXiv.

[9]  Jean Charles Bazin,et al.  Suggesting Sounds for Images from Video Collections , 2016, ECCV Workshops.

[10]  Bernt Schiele,et al.  Dense-CNN: Fully Convolutional Neural Networks for Human Body Pose Estimation , 2016 .

[11]  A. Dittmann,et al.  Body movement and speech rhythm in social conversation. , 1969, Journal of personality and social psychology.

[12]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[14]  Colin Grubb Multimodal Emotion Recognition , 2013 .

[15]  Yisong Yue,et al.  A deep learning approach for generalized speech animation , 2017, ACM Trans. Graph..

[16]  Frédo Durand,et al.  The visual microphone , 2014, ACM Trans. Graph..

[17]  Lawrence S. Chen,et al.  Joint processing of audio-visual information for the recognition of emotional expressions in human-c , 2000 .

[18]  Scott Cohen,et al.  Forecasting Human Dynamics from Static Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Yizhou Yu,et al.  Audeosynth: Music-driven Video Montage , 2015, ACM Trans. Graph..

[20]  Hans-Peter Seidel,et al.  VNect , 2017, ACM Trans. Graph..

[21]  Allen T. Dittmann,et al.  Chapter 7 – The Body Movement-Speech Rhythm Relationship as a Cue to Speech Encoding , 1972 .

[22]  Michael O'Neill,et al.  The Use of Mel-frequency Cepstral Coefficients in Musical Instrument Identification , 2008, ICMC.

[23]  Stefan Kopp,et al.  Gesture and speech in interaction: An overview , 2014, Speech Commun..

[24]  David Demirdjian,et al.  Inferring body pose using speech content , 2005, ICMI '05.

[25]  张国亮,et al.  Comparison of Different Implementations of MFCC , 2001 .

[26]  Michael J. Black,et al.  On Human Motion Prediction Using Recurrent Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Andrew Owens,et al.  Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Zheng Fang,et al.  Comparison of different implementations of MFCC , 2001 .

[29]  Taku Komura,et al.  A Recurrent Variational Autoencoder for Human Motion Synthesis , 2017, BMVC.

[30]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[31]  Nicu Sebe,et al.  Multimodal Human Computer Interaction: A Survey , 2005, ICCV-HCI.

[32]  Rec. ITU-R BT.1359-1 1 RECOMMENDATION ITU-R BT.1359-1 RELATIVE TIMING OF SOUND AND VISION FOR BROADCASTING , 1998 .

[33]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[34]  Björn W. Schuller,et al.  LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework , 2013, Image Vis. Comput..

[35]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  D S BOOMER,et al.  SPEECH RATE, FILLED PAUSE, AND BODY MOVEMENT IN INTERVIEWS , 1964, The Journal of nervous and mental disease.

[37]  Martial Hebert,et al.  The Pose Knows: Video Forecasting by Generating Pose Futures , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Marc R. Thompson,et al.  Exploring relationships between pianists’ body movements, their expressive intentions, and structural elements of the music , 2012 .

[39]  Radu Horaud,et al.  Exploiting the Complementarity of Audio and Visual Data in Multi-speaker Tracking , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[40]  Egil Haga,et al.  Correspondences between music and body movement , 2008 .

[41]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[42]  Andrew Zisserman,et al.  Recurrent Human Pose Estimation , 2016, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[43]  Jaakko Lehtinen,et al.  Audio-driven facial animation by joint end-to-end learning of pose and emotion , 2017, ACM Trans. Graph..

[44]  Aaron Hertzmann,et al.  Style-based inverse kinematics , 2004, ACM Trans. Graph..

[45]  Peter V. Gehler,et al.  Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image , 2016, ECCV.

[46]  Taku Komura,et al.  A Deep Learning Framework for Character Motion Synthesis and Editing , 2016, ACM Trans. Graph..

[47]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[48]  Nicu Sebe,et al.  Multimodal Human Computer Interaction: A Survey , 2005, ICCV-HCI.

[49]  Ira Kemelmacher-Shlizerman,et al.  Synthesizing Obama , 2017, ACM Trans. Graph..

[50]  Bernt Schiele,et al.  DeeperCut: A Deeper, Stronger, and Faster Multi-person Pose Estimation Model , 2016, ECCV.

[51]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[52]  Tsuyoshi Murata,et al.  {m , 1934, ACML.