3-DFACEPOINTTRAJECTORYSYNTHESISUSINGANUTOMATICALLDERIVEDVISUALPHONEMESIMILARITYMATRIXLeventM.ArslanandDavidTalkinEntropicInc.,Washington,DC,20003ABSTRACTThispap erpresentsanovelalgorithmwhichgeneratesthree-dimensionalfacep ointtra jectoriesforagivensp eech lewithorwithoutitstext.Theprop osedalgorithm rstemploysano -linetrainingphase.Inthisphase,recordedfacep ointtra jectoriesalongwiththeirsp eechdataandphoneticlab elsareusedtogenerateco deb o oks.Theseco deb o oksconsistofb othacousticandvisualfea-tures.Acousticsarerepresentedbylinesp ectralfrequencies(LSF),andfacep ointsarerepresentedwiththeirprincipalcomp onents(PC).Duringthesynthesisstage,sp eechinputisratedintermsofitssimilaritytotheco deb o okentries.Basedonthesimilarity,eachco deb o okentryisassignedaweightingco ecient.Ifthephoneticinformationab outthetestsp eechisavailable,thisutilizedinrestrictingco deb o oksearchtoonlyseveralentrieswhicarevisuallyclosesttothecurrentphoneme(avisualsimilaritymatrixisgeneratedforthispurp ose).Thentheseweightsareusedtosynthesizetheprincipalcomp onenofthefacep ointtra jectory.Thep erformanceofalgorithmistestedonheld-outdata,andthesynthesizedfacep ointtra jectoriesshowedacorrelationof0.73withtruefacep ointtra jectories.1.INTRODUCTIONRecentlytherehasb eensigni cantinteresttheareaoffacesynthesis.Thistopichasnumerousapplicationsinclud-ing lmdubbing,computer-basedlanguageinstruction,car-to oncharacteranimation,multimediaentertainment,etc.Thereisalargee ortindevelopingautonomoussoftwareagentsthatcancommunicatewithhumansusingsp eech,facialexpression,gestures,andintonation.KatashiAkikazu(Katashiand1994)employedanimatedfacialexpressionsinasp okendialoguesystem.Otherre-searchers(Casseletal.1994,Bertenstam1995,Beskow1997)usedvariousformsofvisualagentsthatfea-turedanimatedgestures,intonation,andheadmovements.Lipsynchingisanotherapplicationofwideinterest.TheVideoRewritesystem(Bregleretal.1997)usesexist-ingfo otagetoautomaticallycreatenewvideoofap er-sonmouthingwordsthatshedidnotsp eakintheoriginalfo otage.Inthisstudy,weprop oseanewalgorithmtosynthesizethreedimensionalfacep ointtra jectoriescorresp ondingtoanovelutterance.Thegeneralalgorithmdo esnotrequireanytextinput.However,thep erformanceofalgorithmsigni cantlyimprovesifphoneticinformationisknownapriori.Therefore,throughoutthispap erthealgorithmwillb edescrib edassumingphoneticinformationisavailable.Attheendwewilldecsrib ewhatprop osedislikethecasewherephoneticinformationisnotvail-able.mostsigni cantcontributioninthispap eristhatitaddressestheproblemofgeneratingaudiovisualsp eechfromtheacousticsignalalone.Therefore,itisp ossibletoaddtheprop osedsystemacoustics-onlysynthesizerscomplementsp eechwithinformation.Thegeneraloutlineofthepap erisasfollows.Section2describ estheprop osedfacep ointtra jectorysynthesisal-gorithm.Inthissection,theformulationandautomaticgenerationofanovelvisualphonemesimilaritymatrixisdescrib edaswell.Section3presentsthesimulationsandp erformanceevaluation.FinallySection4discussesthere-sultsandfuturedirections.2.ALGORITHMDESCRIPTIONThefacesynthesisalgorithmprop osedinthispap erisanex-tensionoftheSTASCvoicetransformationalgorithmwhichisdescrib edinArslanandTalkin(1997).TheSTASCalgo-rithmmo di estheutteranceofasourcesp eakertosoundlikesp eechfromatargetsp eaker.Theacousticparame-ters(LSF)aretransformedtotargetsp eakeracousticpa-rametersbyemployingaweightedco deb o okmappingap-proach.Thegenerationofaudioco deb o oksinthispap erfollowsthesamemetho dologyinSTASCalgorithm.However,insteadofmappingtheacousticparameterstoatargetsp eaker'sacousticspace,theprop osedalgorithmmapstheincomingsp eechintosourcesp eaker'sownacousticspace,andaugmentsitwithvisualdata.Theowchartoftheprop osedfacesynthesisalgorithmisshowninFigure1.Thealgorithmrequirestwoon-lineinputs:i)adigitizedsp eech le;andii)itscorresp ondingphonemesequence.Italsorequirestwoadditionalinputswhicharegeneratedpriortofacesynthesisduringthetrainingstage:i)anaudio-visualco deb o ok;andii)avisualphonemesimi-laritymatrix.First,wewillexplainhotheco deb o okandthevisualphonemesimilaritymatrixaregenerated.2.1.Audio-VisualCo deb o okGenerationForthedatacollection,synchronizedsp eechandfacep ointtra jectoriesmust rstb erecordedfromsub ject.Forstudythewererecordedusingamulti-cameratriangulationsystemyielding60samples/secatspatialresolution.254mminX,Y,andZ.Inpi-lotstudyrep ortedhere,54p ointsonandaroundtheface
[1]
Kuldip K. Paliwal,et al.
Speech Coding and Synthesis
,
1995
.
[2]
Rajiv Laroia,et al.
Robust and efficient quantization of speech LSP parameters using structured vector quantizers
,
1991,
[Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.
[3]
Levent M. Arslan,et al.
Voice conversion by codebook mapping of line spectral frequencies and excitation spectrum
,
1997,
EUROSPEECH.
[4]
Keinosuke Fukunaga,et al.
Statistical Pattern Recognition
,
1993,
Handbook of Pattern Recognition and Computer Vision.
[5]
Matthew Stone,et al.
Modeling the Interaction between Speech and Gesture.
,
1994
.
[6]
Alan McCree,et al.
New methods for adaptive noise suppression
,
1995,
1995 International Conference on Acoustics, Speech, and Signal Processing.
[7]
Jonas Beskow,et al.
Animation of talking agents
,
1997,
AVSP.
[8]
Speech dialogue with facial displays
,
1994,
CHI '94.
[9]
Hani Yehia,et al.
Quantitative association of orofacial and vocal-tract shapes
,
1997,
AVSP.
[10]
Lynne E. Bernstein,et al.
Effects of phonetic variation and the structure of the lexicon on the uniqueness of words
,
1997,
AVSP.
[11]
Christoph Bregler,et al.
Video rewrite: visual speech synthesis from video
,
1997,
AVSP.