Voice conversion by codebook mapping of line spectral frequencies and excitation spectrum

VOICECONVERSIONBYCODEBOOKMAPPINGOFLINESPECTRALFREQUENCIESANDEXCITATIONSPECTRUMLeventM.ArslanandDavidTalkinEntropicResearchLab oratory,Washington,DC,20003ABSTRACTThispap erpresentsanewschemefordevelopingvoicecon-versionsystemthatmo di estheutteranceofasourcesp eakertosoundlikesp eechfromatargetsp eaker.Wreferthemetho dasSp eakerTransformationAlgorithmusingSegmen-talCo deb o oks(STASC).Twonewmetho dsaredescrib edtop erformthetransformationofvo caltractandglottalexcita-tioncharacteristicsacrosssp eakers.Inaddition,thesourcesp eaker'sgeneralproso diccharacteristicsaremo di edusingtime-scaleandpitch-scalemo di cationalgorithms.Infor-mallisteningtestssuggestthatconvincingvoiceersionisachievedwhilemaintaininghighsp eechquality.Thep erfor-manceoftheprop osedsystemisalsoevaluatedonastandardGaussianmixturemo delbasedsp eakeridenti cationsystem,andtheresultsshowthattransformedsp eechisassignedhigherlikeliho o dbythetargetsp eakermo delwhencomparedtothesourcemo del.1Intro ductionTherehasb eenaconsiderableamountofresearche ortdi-rectedattheproblemofvoicetransformationrecently[1 ,3,4,8 ].Thistopichasnumerousapplicationswhichincludep ersoni cationoftext-to-sp eechsystems,multimediaenter-tainment,andasaprepro cessingsteptosp eechrecognitiontoreducesp eakervariability.Ingeneral,theapproachproblemconsistsofatrainingphasewhereinputsp eechtrain-ingdatafromsourceandtargetsp eakersareusedtoformulateasp ectraltransformationthatwouldmaptheacousticspaceofthesourcesp eakertothattargeter.Theacousticspacecanb echaracterizedwithanumb erofp ossibleacous-ticfeatureswhichhasb eenstudiedextensivelyintheliter-ature.Themostp opularfeaturesusedforvoicetransforma-tionincludeformantfrequencies[1 ],andLPCcepstrumco e-cients[7].Thetransformationisingeneralbasedonco deb o okmapping[1,3 ,7 ].Thatis,aonetocorresp ondenceb e-tweenthesp ectralco deb o okentriesofsourcesp eakerandthetargetsp eakerisdevelop edbysomeformofsup ervisedvectorquantizationmetho d.Ingeneral,thesemetho dsfaceseveralproblemssuchasartifactsintro ducedattheb ound-ariesb etweensuccessivesp eechframes,limitationonrobustestimationofparameters(e.g.,formantfrequencyestimation),ordistortionintro ducedduringsynthesisoftargetsp eech.An-otherissuewhichhasnotb eenexploredindetailisthetrans-formationoftheglottalexcitationcharacteristicsasidefromthevo caltractcharacteristics.Severalstudiesprop osedso-lutionstoaddressthisissuerecently[4 ,7 ].Instudy,weprop osenewande ectivesolutionstob othproblemswiththegoalofmaintaininghighsp eechquality.2AlgorithmDescriptionThissectionprovidesageneraldescriptionoftheSTASCalgo-rithm.Thetrainingsp eech(sampledat16kHz)fromsourceandtargetsp eakersare rstsegmentedautomaticallyusingforcedalignmenttophonetictranslationoftheorthographictranscription.Co deb o okslinesp ectralfrequencies(LSF)areusedinorderrepresentsp ectralcharacteristicsofsourceandtargetsp eakvo caltractcharacteristics.Thereasonforselectinglinesp ectralfrequenciesisthattheseparametersre-latecloselytoformantfrequencies[5 ],butincontrastfor-mantfrequenciestheycanb eestimatedquitereliably.ad-dition,theyhavea xedrangewhichmakesthemattractivforreal-timeDSPimplementation.TheLSFco deb o oksaregeneratedasfollows:Thelinesp ectralfrequenciesforsourceandtargetsp eakerutterancesarecalculatedonaframe-by-framebasisandeachLSFvectorlab eledusingphoneticsegmenter.Next,acentroidLSFvectorforeachphonemeestimatedforb othsourcetargetsp eakerco deb o oksaveragingacrossallcorresp ondingsp eechframes.Aone-to-onemappingisestablishedfromandtar-getco deb o okstoaccomplishthevoicetransformation.Thetransformationwillb eexplainedindetaillaterthissection.Anotherfactorthatinuencessp eakerindividualityisglot-talexcitationcharacteristics.TheLPCresidualcanb erea-sonableapproximationtoglottalexcitationsignal.Itiswellknownthattheresidualcanb everydi erentforphonemes(e.g.,p erio dicpulsetrainforvoicedsoundsersuswhitenoiseforunvoicedsounds).Therefore,weformulateda"co deb o okbased"transformationofexcitationchar-acteristicssimilartotheonediscussedab oveforo caltractsp ectrumtransformation.Co deb o oksforexcitationcharac-teristicsareobtainedasfollows:Usingsegmentationin-formation,LPCresidualsignalsforeachphonemeinco deb o okarecollectedfromthetrainingdata.Next,short-timeaveragemagnitudesp ectrumofexcitationsignalisestimatedeachphonemeb oththesourcesp eakerandthetargetsp eakerpitchsynchronously.Anexcitationtrans-formation ltercanb eformulatedforeachco dewordentryusingtheexcitationsp ectraofsourcesp eakerandtar-geter.Thismetho dnotonlytransformsexcitationcharacteristics,butitestimatesareasonabletransformationforthe"zeros"insp ectrumaswell,whicharenotrep-resentedaccuratelybyall-p olemo deling.Therefore,thismetho dresultedinimprovedoiceconersionp erformancees-p eciallyfornasalizedsounds.TheowdiagramtheSTASCvoicetransformational-gorithmshownFigure1.Theincomingsp eec rstsampledat16kHzandpreemphasizedwith lterP(z)=10:95z1.Next,18thorderLPCanalysisisp erformedtoestimatethepredictionco ecients.Basedonco ecients,aninverse lter,A(z),isformulatedas:A(z)=1PXk=1akk:(1) lterisusedtoestimategs(n)whichanapproxima-tionoftheexcitationsignalforsp eaker.Next,line1

[1]  Yannis Stylianou,et al.  On the transformation of the speech spectrum for voice conversion , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[2]  Alan McCree,et al.  New methods for adaptive noise suppression , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[3]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[4]  Joel R. Crosmer,et al.  Very low bit rate speech coding using the line spectrum pair transformation of the LPC coefficients , 1985 .

[5]  Dae Hee Youn,et al.  A new voice transformation method based on both linear and nonlinear prediction analysis , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[6]  Rajiv Laroia,et al.  Robust and efficient quantization of speech LSP parameters using structured vector quantizers , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.