论文信息 - Speech coding using mixture of gaussians polynomial model

Speech coding using mixture of gaussians polynomial model

SPEECHCODINGUSINGMIXTUREOFGAUSSIANSPOLYNOMIALMODELParham ZolfaghariyTony RobinsonCREST/ATR Human Information Pro cessing Research Labs, Kyoto 619-02, Japanemail :zparham@hip.atr.co.jpyCambridge University Engineering Department,Cambridge CB2 1PZ, UKemail :ajr@eng.cam.ac.ukABSTRACTWehaveinestigated a noel metho d of sp ectral estimationbased on mixture of Gaussians in a sinusoidal analysis andsynthesisframework.Afterquantisationofthisparamet-ric scheme a xed frame-rate co der op erating at a bit-rate ofaround 2.4 kbits/s has b een develop ed. This pap er describ esanextensiontothissp ectralmo delbasedonconstrainingthe parameters of the mixture of Gaussians to b e on a p oly-nomialtra jectoryoverasegmentofsp eechdata.ThisisreferredtoasthemixtureofGaussiansp olynomialmo del(MGPM).Inordertorealiseasegmentalco der,dynamicprogramming over the utterance is p erformed.The segmen-talrepresentationofthe sp ectraresultsinalog-likeliho o dscoreover a segment which is used as the cost function inthe dynamic programming algorithm.Sp eech co ding com-p onents suchaspitch,voicingand gainaredescrib edseg-mentally.Anumb er of segmental co ders are presented withbit-rates in the range of 350 to 650 bits/s.These co ders of-fer go o d and intelligible co ded sp eechevaluated using DRTscoring at these bit-rates.1.INTRODUCTIONA segmental framework employs the inter-frame or time de-p endenceofthesp ectralrepresentation.Thisdep endenceis inherentinvarious segments of sp eech, such as sustainedvowels, as the sp eech sp ectral enelop e is a slow time-varyingpro cess and sp ectra of adjacent frames are highly correlated.Variousformsofsegmentationmo delshaveb eenappliedtosp eechco dingandrecognition.Inco d-ingRoucosetal[11 ]describ eaverylowbit-ratesegmentvocoderop eratingat 150bits/sfora singlesp eaker.Thislow rate is achieved byvector quantisation (VQ) of all theLPC sp ectra in a segment as a single unit. The Kang-Coulter600bits/svocoder[6 ]alsousesLPCmetho ds followedbyformant tracking to pro duce go o d quality sp eech with a re-p orted DRT score of 79.9.These low bit-rates can also b eachieved by a recognition-based approach where recognitionunits are co ded.Holmes [5 ] has describ ed a metho d whichuses an underlying linear-tra jectory formant mo del for b othrecognition and synthesis.Thecontributionofthisworkistomo deltheenvelop eoftheshort-termpowersp ectraldensityasamixtureGaussians [13 ].In this framework a Gaussian roughly corre-sp onds to a formant with the Gaussian mean corresp ondingto the formant frequency and the variance corresp onding tothe bandwidth.Thismo del wasintegrated in a sinusoidalmo delbasedsp eechco dingscheme[14 ].Anadvantageofthis frameworkis that a sp eech segmentmay b e mo delledusing a p olynomial tra jectoryfor the Gaussianmeans andvariances.Wehae previously rep orted on a segmental co derusing a linear p olynomial tra jectory for the Gaussian mix-tures op erating b etween 600-800 bits/s [15 ].We extend thismo del to an R'th order p olynomial to represent b oth meansandvariancesofthe Gaussians.In the sp eechrecognitionarea, similar mo dels have also b een implemented for MFCCtra jectories in a HMM-based system [4].2.SEGMENTAL CODER STRUCTUREThe blo ck structure of the co ders describ ed in this pap er isas shown in Figure 1.A sinusoidal mo del framework basedonthe ideasofMcAulayandQuatieri[8 ]isused.Inthismo del, the sp eech signal is represented by a harmonic set ofpartials with varying amplitudes and frequencies.In accor-dance with our desire to build a very low bit-rate co der werestrict the sine waves to b e harmonically related.The in-verse FFT metho d of re-synthesis [3 ] is used and the phase ofeach harmonic is chosen at reconstruction time to minimisethe mismatch with the previous frame.TheSp ectralEnvelop eEstimationVocoder(SEEVOC)envelop e,devised byPaul [9] uses a robust p eak detectionalgorithm to yield a smo oth envelop e as the underlying sp ec-tral representation.In order to op erate in the low bit-rateregion, the SEEVOC envelop e needs to b e eciently co ded.We aid the mixture of Gaussians p olynomial mo del to rep-resent this sp ectra over a segment.Polynomial least squares

Parham Zolfaghari | Tony Robinson | T. Robinson | P. Zolfaghari

[1] Kuldip K. Paliwal,et al. Model parameter estimation for mixture density polynomial segment models , 1998, Comput. Speech Lang..

[2] Richard M. Schwartz,et al. A segment vocoder at 150 b/s , 1983, ICASSP.

[3] Parham Zolfaghari,et al. A formant vocoder based on mixtures of Gaussians , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4] Thomas F. Quatieri,et al. Magnitude-only reconstruction using a sinusoidal speech modelMagnitude-only reconstruction using a sinusoidal speech model , 1984, ICASSP.

[5] George S Kang,et al. 600-Bit-Per-Second Voice Digitizer (Linear Predictive Formant Vocoder). , 1976 .

[6] Heekuck Oh,et al. Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[7] Parham Zolfaghari,et al. A segmental formant vocoder based on linearly varying mixture of Gaussians , 1997, EUROSPEECH.

[8] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[9] D. Paul. The spectral envelope estimation vocoder , 1981 .

[10] Parham Zolfaghari,et al. Formant analysis using mixtures of Gaussians , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[11] Steve Renals,et al. WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.