An audio-driven dancing avatar

We present a framework for training and synthesis of an audio-driven dancing avatar. The avatar is trained for a given musical genre using the multicamera video recordings of a dance performance. The video is analyzed to capture the time-varying posture of the dancer’s body whereas the musical audio signal is processed to extract the beat information. We consider two different marker-based schemes for the motion capture problem. The first scheme uses 3D joint positions to represent the body motion whereas the second uses joint angles. Body movements of the dancer are characterized by a set of recurring semantic motion patterns, i.e., dance figures. Each dance figure is modeled in a supervised manner with a set of HMM (Hidden Markov Model) structures and the associated beat frequency. In the synthesis phase, an audio signal of unknown musical type is first classified, within a time interval, into one of the genres that have been learnt in the analysis phase, based on mel frequency cepstral coefficients (MFCC). The motion parameters of the corresponding dance figures are then synthesized via the trained HMM structures in synchrony with the audio signal based on the estimated tempo information. Finally, the generated motion parameters, either the joint angles or the 3D joint positions of the body, are animated along with the musical audio using two different animation tools that we have developed. Experimental results demonstrate the effectiveness of the proposed framework.

[1]  Taku Komura,et al.  Topology matching for fully automatic similarity estimation of 3D shapes , 2001, SIGGRAPH.

[2]  Ian D. Reid,et al.  Articulated Body Motion Capture by Stochastic Search , 2005, International Journal of Computer Vision.

[3]  Harry Shum,et al.  Learning dynamic audio-visual mapping with input-output Hidden Markov models , 2006, IEEE Trans. Multim..

[4]  Miguel A. Alonso,et al.  Tempo And Beat Estimation Of Musical Signals , 2004, ISMIR.

[5]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[6]  Neil J. Gordon,et al.  A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking , 2002, IEEE Trans. Signal Process..

[7]  Montse Pardàs,et al.  Towards a Bayesian Approach to Robust Finding Correspondences in Multiple View Geometry Environments , 2005, International Conference on Computational Science.

[8]  Shinji Miyazaki,et al.  Comparison of the performance of 3D camera systems , 1995 .

[9]  Ulas Bagci,et al.  Automatic Classification of Musical Genres Using Inter-Genre Similarity , 2007, IEEE Signal Processing Letters.

[10]  Steve Young,et al.  The HTK hidden Markov model toolkit: design and philosophy , 1993 .

[11]  A. Murat Tekalp,et al.  Prosody-Driven Head-Gesture Animation , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[12]  Fabio Pianesi,et al.  Xface open source project and smil-agent scripting language for creating and animating embodied conversational agents , 2007, ACM Multimedia.

[13]  A. Murat Tekalp,et al.  Combined Gesture-Speech Analysis and Speech Driven Gesture Synthesis , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[14]  Tsuhan Chen,et al.  Audiovisual speech processing , 2001, IEEE Signal Process. Mag..

[15]  Jitendra Malik,et al.  Tracking people with twists and exponential maps , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[16]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[17]  A. Murat Tekalp,et al.  Estimation and Analysis of Facial Animation Parameter Patterns , 2007, 2007 IEEE International Conference on Image Processing.