3-D Face Point Trajectory Synthesis Using An Automatically Derived Visual Phoneme Similarity Matrix

3-DFACEPOINTTRAJECTORYSYNTHESISUSINGANUTOMATICALLDERIVEDVISUALPHONEMESIMILARITYMATRIXLeventM.ArslanandDavidTalkinEntropicInc.,Washington,DC,20003ABSTRACTThispap erpresentsanovelalgorithmwhichgeneratesthree-dimensionalfacep ointtra jectoriesforagivensp eech lewithorwithoutitstext.Theprop osedalgorithm rstemploysano -linetrainingphase.Inthisphase,recordedfacep ointtra jectoriesalongwiththeirsp eechdataandphoneticlab elsareusedtogenerateco deb o oks.Theseco deb o oksconsistofb othacousticandvisualfea-tures.Acousticsarerepresentedbylinesp ectralfrequencies(LSF),andfacep ointsarerepresentedwiththeirprincipalcomp onents(PC).Duringthesynthesisstage,sp eechinputisratedintermsofitssimilaritytotheco deb o okentries.Basedonthesimilarity,eachco deb o okentryisassignedaweightingco ecient.Ifthephoneticinformationab outthetestsp eechisavailable,thisutilizedinrestrictingco deb o oksearchtoonlyseveralentrieswhicarevisuallyclosesttothecurrentphoneme(avisualsimilaritymatrixisgeneratedforthispurp ose).Thentheseweightsareusedtosynthesizetheprincipalcomp onenofthefacep ointtra jectory.Thep erformanceofalgorithmistestedonheld-outdata,andthesynthesizedfacep ointtra jectoriesshowedacorrelationof0.73withtruefacep ointtra jectories.1.INTRODUCTIONRecentlytherehasb eensigni cantinteresttheareaoffacesynthesis.Thistopichasnumerousapplicationsinclud-ing lmdubbing,computer-basedlanguageinstruction,car-to oncharacteranimation,multimediaentertainment,etc.Thereisalargee ortindevelopingautonomoussoftwareagentsthatcancommunicatewithhumansusingsp eech,facialexpression,gestures,andintonation.KatashiAkikazu(Katashiand1994)employedanimatedfacialexpressionsinasp okendialoguesystem.Otherre-searchers(Casseletal.1994,Bertenstam1995,Beskow1997)usedvariousformsofvisualagentsthatfea-turedanimatedgestures,intonation,andheadmovements.Lipsynchingisanotherapplicationofwideinterest.TheVideoRewritesystem(Bregleretal.1997)usesexist-ingfo otagetoautomaticallycreatenewvideoofap er-sonmouthingwordsthatshedidnotsp eakintheoriginalfo otage.Inthisstudy,weprop oseanewalgorithmtosynthesizethreedimensionalfacep ointtra jectoriescorresp ondingtoanovelutterance.Thegeneralalgorithmdo esnotrequireanytextinput.However,thep erformanceofalgorithmsigni cantlyimprovesifphoneticinformationisknownapriori.Therefore,throughoutthispap erthealgorithmwillb edescrib edassumingphoneticinformationisavailable.Attheendwewilldecsrib ewhatprop osedislikethecasewherephoneticinformationisnotvail-able.mostsigni cantcontributioninthispap eristhatitaddressestheproblemofgeneratingaudiovisualsp eechfromtheacousticsignalalone.Therefore,itisp ossibletoaddtheprop osedsystemacoustics-onlysynthesizerscomplementsp eechwithinformation.Thegeneraloutlineofthepap erisasfollows.Section2describ estheprop osedfacep ointtra jectorysynthesisal-gorithm.Inthissection,theformulationandautomaticgenerationofanovelvisualphonemesimilaritymatrixisdescrib edaswell.Section3presentsthesimulationsandp erformanceevaluation.FinallySection4discussesthere-sultsandfuturedirections.2.ALGORITHMDESCRIPTIONThefacesynthesisalgorithmprop osedinthispap erisanex-tensionoftheSTASCvoicetransformationalgorithmwhichisdescrib edinArslanandTalkin(1997).TheSTASCalgo-rithmmo di estheutteranceofasourcesp eakertosoundlikesp eechfromatargetsp eaker.Theacousticparame-ters(LSF)aretransformedtotargetsp eakeracousticpa-rametersbyemployingaweightedco deb o okmappingap-proach.Thegenerationofaudioco deb o oksinthispap erfollowsthesamemetho dologyinSTASCalgorithm.However,insteadofmappingtheacousticparameterstoatargetsp eaker'sacousticspace,theprop osedalgorithmmapstheincomingsp eechintosourcesp eaker'sownacousticspace,andaugmentsitwithvisualdata.Theowchartoftheprop osedfacesynthesisalgorithmisshowninFigure1.Thealgorithmrequirestwoon-lineinputs:i)adigitizedsp eech le;andii)itscorresp ondingphonemesequence.Italsorequirestwoadditionalinputswhicharegeneratedpriortofacesynthesisduringthetrainingstage:i)anaudio-visualco deb o ok;andii)avisualphonemesimi-laritymatrix.First,wewillexplainhotheco deb o okandthevisualphonemesimilaritymatrixaregenerated.2.1.Audio-VisualCo deb o okGenerationForthedatacollection,synchronizedsp eechandfacep ointtra jectoriesmust rstb erecordedfromsub ject.Forstudythewererecordedusingamulti-cameratriangulationsystemyielding60samples/secatspatialresolution.254mminX,Y,andZ.Inpi-lotstudyrep ortedhere,54p ointsonandaroundtheface

[1]  Kuldip K. Paliwal,et al.  Speech Coding and Synthesis , 1995 .

[2]  Rajiv Laroia,et al.  Robust and efficient quantization of speech LSP parameters using structured vector quantizers , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[3]  Levent M. Arslan,et al.  Voice conversion by codebook mapping of line spectral frequencies and excitation spectrum , 1997, EUROSPEECH.

[4]  Keinosuke Fukunaga,et al.  Statistical Pattern Recognition , 1993, Handbook of Pattern Recognition and Computer Vision.

[5]  Matthew Stone,et al.  Modeling the Interaction between Speech and Gesture. , 1994 .

[6]  Alan McCree,et al.  New methods for adaptive noise suppression , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[7]  Jonas Beskow,et al.  Animation of talking agents , 1997, AVSP.

[8]  Speech dialogue with facial displays , 1994, CHI '94.

[9]  Hani Yehia,et al.  Quantitative association of orofacial and vocal-tract shapes , 1997, AVSP.

[10]  Lynne E. Bernstein,et al.  Effects of phonetic variation and the structure of the lexicon on the uniqueness of words , 1997, AVSP.

[11]  Christoph Bregler,et al.  Video rewrite: visual speech synthesis from video , 1997, AVSP.