论文信息 - Automatic audio driven animation of non-verbal actions

Automatic audio driven animation of non-verbal actions

D. Cosker1, C. Holt2, D. Mason2, G. Whatling 2, D. Marshall3 and P. L. Rosin31 Media Technology Research Centre, University of Bath. D.P.Cosker@cs.bath.ac.uk2 School of Engineering, Cardiff University. Holt, MasonD, Whatlinggm@cardiff.ac.uk3 School of Computer Science, Cardiff University. Dave.Marshall, Paul.Rosin@cs.cf.ac.ukKeywords: Non-Verbal, HMM, Animation, Motion-Capture1 IntroductionWhile speech driven animation for lip-synching and facialexpressionsynthesis fromspeechhaspreviouslyreceivedmuchattention [1, 2], there is little or no previouswork on generatingnon-verbal actions such as laughing and crying automaticallyfrom an audio signal. In this article initial results on a systemdesigned to address this issue are presented.2 System OverviewFigure 1 gives an overview of our current system. 3D facialdata was recorded for a participant making different actions –i.e. laughing, crying, yawning and sneezing – using a Qualysis(Sweden) optical motion-capture system while simultaneouslyrecording audio data. 30 retro-reﬂective markers were placedon the participant’s face to capture movement. Using this data,an analysis and synthesis machine was then trained consistingof a dual-input Hidden Markov Model (HMM) and a trellissearch algorithm which converts HMM visual states and newinput audio into new 3D motion-capture data.Figure 1: New motion-capture animations are createdautomatically from new audio-recordings. This data maythen drive a more detailed 3D facial model.3 Analysis and SynthesisAfter normalising the data with respect to head pose variation,its dimensionality is then reduced using PCA. Audio isrepresented using Mel Frequency Cepstral Coefﬁcients(MFCC). A HMM is trained using the visual features anda dual mapping created from each HMM state to the audiofeatures. This allows a new HMM visual state sequence to becreated given novel input audio. Given a new visual HMMstate sequence, a 3D facial motion-capture output is created ateach state (and each time t) by ﬁnding the visual feature withthe best matching corresponding audio feature. In this process,features which are dissimilar to the feature selected at time t-1are also penalised.4 Results and DiscussionA participant was recorded making approximately tenrepetitions of a particular action, i.e. laughing, crying,yawning and sneezing. A synthesis machine was then trainedusing combined observation data from all actions. Using aleave-one-observationout strategy, we tested the models abilityto resynthesise actions solely from previously unobservedaudio.Synthesised motion-capture animation results show anexcellent correlation to new audio data. This is furthersupported by the low RMS errors (in millimetres) calculatedby comparing synthetic animations with their ground truth(see Table 1). In order to ﬁnd a stable number of HMM states,and to test repeatability in light of the random initialisationof HMMs [1], the experimental set-up was repeated multipletimes for various state numbers. A one-way ANOVA showedthat repeated trails given 30 or more HMM states resultedin consistently strong results with low RMS errors (i.e. at achance level of p <0.05).

[1] Matthew Brand,et al. Voice puppetry , 1999, SIGGRAPH.

[2] Frédéric H. Pighin,et al. Expressive speech-driven facial animation , 2005, TOGS.