VOICECONVERSIONBYCODEBOOKMAPPINGOFLINESPECTRALFREQUENCIESANDEXCITATIONSPECTRUMLeventM.ArslanandDavidTalkinEntropicResearchLab oratory,Washington,DC,20003ABSTRACTThispap erpresentsanewschemefordevelopingvoicecon-versionsystemthatmo di estheutteranceofasourcesp eakertosoundlikesp eechfromatargetsp eaker.Wreferthemetho dasSp eakerTransformationAlgorithmusingSegmen-talCo deb o oks(STASC).Twonewmetho dsaredescrib edtop erformthetransformationofvo caltractandglottalexcita-tioncharacteristicsacrosssp eakers.Inaddition,thesourcesp eaker'sgeneralproso diccharacteristicsaremo di edusingtime-scaleandpitch-scalemo di cationalgorithms.Infor-mallisteningtestssuggestthatconvincingvoiceersionisachievedwhilemaintaininghighsp eechquality.Thep erfor-manceoftheprop osedsystemisalsoevaluatedonastandardGaussianmixturemo delbasedsp eakeridenti cationsystem,andtheresultsshowthattransformedsp eechisassignedhigherlikeliho o dbythetargetsp eakermo delwhencomparedtothesourcemo del.1Intro ductionTherehasb eenaconsiderableamountofresearche ortdi-rectedattheproblemofvoicetransformationrecently[1 ,3,4,8 ].Thistopichasnumerousapplicationswhichincludep ersoni cationoftext-to-sp eechsystems,multimediaenter-tainment,andasaprepro cessingsteptosp eechrecognitiontoreducesp eakervariability.Ingeneral,theapproachproblemconsistsofatrainingphasewhereinputsp eechtrain-ingdatafromsourceandtargetsp eakersareusedtoformulateasp ectraltransformationthatwouldmaptheacousticspaceofthesourcesp eakertothattargeter.Theacousticspacecanb echaracterizedwithanumb erofp ossibleacous-ticfeatureswhichhasb eenstudiedextensivelyintheliter-ature.Themostp opularfeaturesusedforvoicetransforma-tionincludeformantfrequencies[1 ],andLPCcepstrumco e-cients[7].Thetransformationisingeneralbasedonco deb o okmapping[1,3 ,7 ].Thatis,aonetocorresp ondenceb e-tweenthesp ectralco deb o okentriesofsourcesp eakerandthetargetsp eakerisdevelop edbysomeformofsup ervisedvectorquantizationmetho d.Ingeneral,thesemetho dsfaceseveralproblemssuchasartifactsintro ducedattheb ound-ariesb etweensuccessivesp eechframes,limitationonrobustestimationofparameters(e.g.,formantfrequencyestimation),ordistortionintro ducedduringsynthesisoftargetsp eech.An-otherissuewhichhasnotb eenexploredindetailisthetrans-formationoftheglottalexcitationcharacteristicsasidefromthevo caltractcharacteristics.Severalstudiesprop osedso-lutionstoaddressthisissuerecently[4 ,7 ].Instudy,weprop osenewande ectivesolutionstob othproblemswiththegoalofmaintaininghighsp eechquality.2AlgorithmDescriptionThissectionprovidesageneraldescriptionoftheSTASCalgo-rithm.Thetrainingsp eech(sampledat16kHz)fromsourceandtargetsp eakersare rstsegmentedautomaticallyusingforcedalignmenttophonetictranslationoftheorthographictranscription.Co deb o okslinesp ectralfrequencies(LSF)areusedinorderrepresentsp ectralcharacteristicsofsourceandtargetsp eakvo caltractcharacteristics.Thereasonforselectinglinesp ectralfrequenciesisthattheseparametersre-latecloselytoformantfrequencies[5 ],butincontrastfor-mantfrequenciestheycanb eestimatedquitereliably.ad-dition,theyhavea xedrangewhichmakesthemattractivforreal-timeDSPimplementation.TheLSFco deb o oksaregeneratedasfollows:Thelinesp ectralfrequenciesforsourceandtargetsp eakerutterancesarecalculatedonaframe-by-framebasisandeachLSFvectorlab eledusingphoneticsegmenter.Next,acentroidLSFvectorforeachphonemeestimatedforb othsourcetargetsp eakerco deb o oksaveragingacrossallcorresp ondingsp eechframes.Aone-to-onemappingisestablishedfromandtar-getco deb o okstoaccomplishthevoicetransformation.Thetransformationwillb eexplainedindetaillaterthissection.Anotherfactorthatinuencessp eakerindividualityisglot-talexcitationcharacteristics.TheLPCresidualcanb erea-sonableapproximationtoglottalexcitationsignal.Itiswellknownthattheresidualcanb everydi erentforphonemes(e.g.,p erio dicpulsetrainforvoicedsoundsersuswhitenoiseforunvoicedsounds).Therefore,weformulateda"co deb o okbased"transformationofexcitationchar-acteristicssimilartotheonediscussedab oveforo caltractsp ectrumtransformation.Co deb o oksforexcitationcharac-teristicsareobtainedasfollows:Usingsegmentationin-formation,LPCresidualsignalsforeachphonemeinco deb o okarecollectedfromthetrainingdata.Next,short-timeaveragemagnitudesp ectrumofexcitationsignalisestimatedeachphonemeb oththesourcesp eakerandthetargetsp eakerpitchsynchronously.Anexcitationtrans-formation ltercanb eformulatedforeachco dewordentryusingtheexcitationsp ectraofsourcesp eakerandtar-geter.Thismetho dnotonlytransformsexcitationcharacteristics,butitestimatesareasonabletransformationforthe"zeros"insp ectrumaswell,whicharenotrep-resentedaccuratelybyall-p olemo deling.Therefore,thismetho dresultedinimprovedoiceconersionp erformancees-p eciallyfornasalizedsounds.TheowdiagramtheSTASCvoicetransformational-gorithmshownFigure1.Theincomingsp eec rstsampledat16kHzandpreemphasizedwith lterP(z)=10:95z1.Next,18thorderLPCanalysisisp erformedtoestimatethepredictionco ecients.Basedonco ecients,aninverse lter,A(z),isformulatedas:A(z)=1PXk=1akk:(1) lterisusedtoestimategs(n)whichanapproxima-tionoftheexcitationsignalforsp eaker.Next,line1
[1]
Yannis Stylianou,et al.
On the transformation of the speech spectrum for voice conversion
,
1996,
Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.
[2]
Alan McCree,et al.
New methods for adaptive noise suppression
,
1995,
1995 International Conference on Acoustics, Speech, and Signal Processing.
[3]
Satoshi Nakamura,et al.
Voice conversion through vector quantization
,
1988,
ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.
[4]
Joel R. Crosmer,et al.
Very low bit rate speech coding using the line spectrum pair transformation of the LPC coefficients
,
1985
.
[5]
Dae Hee Youn,et al.
A new voice transformation method based on both linear and nonlinear prediction analysis
,
1996,
Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.
[6]
Rajiv Laroia,et al.
Robust and efficient quantization of speech LSP parameters using structured vector quantizers
,
1991,
[Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.